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mainframe on a desktop for individual engineers and scientists that may give m^jor impetus to the use of 
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VLSI Technology Packs 32-Bit Computer 
System into a Small Package 

The new HP 9000 Computer is a compact highly capable 
32'bft computer system that incorporates five very dense 
integrated circuits made by a highly refined NMOS process. 

by Joseph W. Beyers, Eugene R. Zetler, and S, Dana Seccombe 



HOW DOES ONE GO ABOLT PACKING llie power of 
a large mainframe computer into a desktop comput- 
er? Answering this question was oiAy one of the 
many problems facing the HP design team given the 
assignment of developing a personal engineering design 
station with enough computing power to allow the entire 
design process to take place on an engineer's bench. Their 
answer is a fully integrated 32-bit processing system 
based on five custom VLSI circuits. This required the 
development of three key technologies: 
■ A 3 2- hit system architecture realized by using advanced 

circuit design techniques 

A state-of-the-art NMOS VLSI* process optimized for 

density and performance 
m A new circuit board to dissipate the heat generated by the 

VLSI circuits and allow high-speed signal propagation. 

System Overview 

A block diagram of the 32-bit processing system is shown 
in Fig. 1. The system uses five different NMOS circuits 
operating at 18 MHz. These chips include a 32- hit CPU, an 
I/O processor, a memory control ier, a 128K-bit RAM, and a 
clock driver (Fig. 2), 

CPU. This single-chip 32-bit processor contains 450,000 
transistors J It is microprogrammed and has 9K 38'bit 
words of resident control store. It has twenty-eight 32'bit 
registers, 3 32- bit ALU (arithmetic/iogic unit) with multiply 
and divide logic, an N-bit shifter for bit extraction and 
alignmenL and a seven-register port to the memory proces- 
sor bus. The stack*oriented instruction set contains 
floating- point, string, and compiler optimization instruc- 
tions. A 32-bit load instruction (including complete bounds 
checking) takes 550 ns and a 64-bit floating-point multiply 
takes (3 ^s. Microinstructions can execute in 55 ns, 
VO Processor (lOP). The lOP is also microprogrammed and 
contains 4.5 K 38- bit words of control store. It handles eight 
DMA (direct memory access] channels with a data rate of up 
to 5M bytes/s. It has sixteen soft ware- programmable inter- 
rupt levels and can independently execute command se- 
quences from memory. 

Memory Controller. The memory controller chip can con- 
trol 256K bytes of RAM. perform byte, ha If- word, word, and 
semaphore* * operations, and do single*bit error correction 
and double-bit error detection of memory without any per- 

Y( channel metal o«<de semiconducior. very ta/ge scaie w^tegmtiori. 
"Uafld lo coffE^oi ai^cesses in a muiiipie pfoceasoi uysrem 



formance penalty. It can also heal up lo 32 faulty locations 
and map logical to physical addresses in 16K-byte blocks. 
RAM. The 16Kx 8-bit RAM chip contains 128K bits of ran- 
dom access memory organised with redundant rows and 
columns. It is pipelined and has a 165-ns access time and a 
110-ns cycle time. 

Clock. The clock chip generates two nonoverlapping 
18- MHz clock signals from a Sfi-MHz sine wave. It can dfive 
a 1500-pF load with a 6-ns rise time. 

Memory Processor Bus 

The CPU. lOP, and memory controller communicate via 
the memor\'^ processor bus (MPB). The protocol of this 44- 
line, 36M-hyte/s bus can support up to seven CPUs or lOPs 
and fifteen memory controllers. This precharged dynamic 
bus is multiplexed between 29-bit addresses and 32-bit data 
words on alternating 55-ns clock cycles. The memory ac- 
cesses are pipelined to allow sending up to two new ad- 
dresses while the first data word is fetched from memory. 

Because the 36M-byte/s data rate far exceeds the data 
requfrements of one CPU, additional CPUs can be added 
and/or independent lOP operations can occur without a 
significant reduction in performance. Fig. 3 shows relative 
system performance as CPUs are added to the bus. 
Computation-intensive examples tend to approach the 
'ideal" line ivhile heavy string operation performance 
tends to be lower than the average. The 32-bit CPUs are 
designed so that additional CPUs can be transparently 
added lo the system. New tasks are usually assigned to 
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Fig. 1 . Block diagram of 32-bit computer system based on 
ffve VLSI Circuits: CPU. fOP, dock, memory coniroiler, and 
RAM 
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whiciiever CPU on the bus is free. However, a CPU can be 
dedicated to specific tasks. For example, the I/O processors 
can be programmed to send interrupt requests to either a 
specific CPU or all CPUs. 

Packaging 

Fig, 4 shows a picture of how the above system is pack- 
aged for the |-tP 9000 product line. The package, called the 
Memory/Processor Module, can hold up to twelve circuit 
boards. This allows a system configuration of up to 2.5 
megabytes of memor>^ with one CPU and one lOP. Up to 
three CPUs and three lOPs can be used for increased per- 
formance by sacrificing some of the memory. Power is 
supplied through two connectors on the bottom of the 
package. In the worst-case configuration, the system dissi- 
pates 185 w^atts. Forced air flow^ is used to cool the VLSI 
circuits to below a worst-case junction temperature of QO^C. 

The standard I/O bus exits the package through a slot in 
the bottom and two optional I/O buses exit through connec- 
tors on the module's door. 

The VLSI chips are mounted on **finstrates." This name 



Fig. 2. Microphotogrsphs of the 

f!ve /C chips for the HP 9000 Com- 
puter (a) CPU (9.2y-). (b) //O pro- 
cessor ('9.2 x I (c) Memory control - 
fer (92x)^ (d) 128K^bitRAM(9x). 
(e) Clock (iSx), 



was coined from this circuit board's dual role as a cooling 
fin and chip substrate. The Teflon^^ dielectric covering 
provides for low-capacitance interconnections and the 
finstrate's copper core spreads the heat aw^ay from the 
chips. The three types of finstrates used in the module are 
shown in Fig. 5. In the center is the CPU board, which 
contains the CPU chip and a clock chip. On the right fin- 
strate are an I OP chip and a clock chip. The inset on 
the upper right side of this lOP finstrate is where a small 
printed circuit board containing a set of TTL buffers for 
driving the I/O bus is attached. The memor>^ finstrate on the 
left contains a memory controller, a clocks and twenty 
12SK-bit RAM chips to provide 25 6K bytes of single-bit- 
err or- correcting memory with a 36M-byte/s data rate. 

The high-speed MPB exists only on the edges of the 
finstrates and the module's motherboard. The Memory/ 
Processor Module also contains a small printed circuit card 
that generates a 36-MHz master clock sine wave driven to 
each of the twelve finstrate slots. 

The system is mechanically seif-contained. Elec- 
tromagnetic interference (EMI) is suppressed by using spe- 



4 HEWLETT-PACKARD JOUR^fAL AUGUST 1933 



)Copr. 1949-1998 Hewlett-Packard Co. 



Performance ' 
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Rg. 3, Multiple 32-bft CPU performance for Whetstone BID 
benchmark. 



cial filters on the power supply connectors, honeycomb air 
filters at both ends, conductive dopr gaskets, and shielded 
L'O cables, 

VLSI NMOS Process 

The major process design goal was to develop a high- 



density» high-performance, highly reliable, production- 
volume, \XSI process. These goals were realized by the use 
of a modified n-channel si lie on- gate MOS process featuring 
3V2 levels of interconnect: diffusion, polysilicon with 
buried contacts, and two levels of refractor>* metal." 

Integrated circuit densities are determined primarily b>* 
the minimum feature size. Lithographic considerations set 
this limit and resulted in layout rules and process capa- 
bilities that enable transistors to be fabricated with a 
minimum pitch of 2.5 /im (1.5-/xm-wide lines spaced 1.0 
fim apart). The unconventional contact- over-gate device 
stnjctijre allowed even tighter layout rules. 

Another technique used to achieve the high circuit den- 
sity is the use of two interconnect levels of refractor}' metal. 
Tungsten metal ligation was chosen because of its high con- 
ductivity and its resistance to electromigration.^ 

In addition to high density, transistor performance was 
emphasized. Special transistor characteristics {e,g., gate- 
to-drain overlap capacitance and threshol d voltage versus 
backgate voltage dependence) required the use of self- 
aligned gates, shallow source and drain regions, and trans- 
istor threshold voltage implants. 

Refiabflity and SelMest 

Besides maximum performance* high reliability and easy 
serviceability were also key design goals. These goals were 
achieved through several approaches. First, the NMOS pro- 
cess was designed for high rehability — silicon gates, refrac* 
torjr metal (no metal migration problems], and conservative 
design specifications that protect against gate hot-electron 
injection. Second, except lor the L'O drivers and clock cir- 
cuit, the system is fully integrated — 92 integrated circuit 
chips for a system with one megabyte of memory. Third, 




Fig* 4. 7"/iJs package, called the 
Memory ^Processor Module, can 
contain a complete 32-bft compu- 
ter system with 2.5 megabytes of 
RAM Performance can be en- 
hanced by exchanging some of 
the memory boards for additional 
CPU a n dior 1 10 pro cess or boards . 
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Fig. 5. To dissipate the heat gen- 
ef Sited by the very dense VLSI 
chips, special boards, catfed 
ftnstrates, were developed. Shown 
from left to right are the 128K-byre 
RAM, CPU^ and lOP finstrates. 



special reliability features such as single- bit error correction 
and double-bit error detection were incorporated into 
the system's architecture. Up to 32 faulty locations per 
memon^ finstrate can be healed by redirecting their con- 
tents to registers on the memory controller chip, and mem- 
ory size can be degraded in 16K-byte blocks {the address 
space can be assigned arbitrarily within or between mem- 
ory cards in 16K-byte increments). Any card can be easily 
removed and the system will still operate, assuming, of 
course, that there are still at least one CPUt one I/O proces- 
sor* and one memory card left in the system. 

In addition, the thorough internal self-test can quickly 
identify any bad cards. At power- on, each card is tested 



without the need for external software and if a fault is 
detected, appropriate LEDs (light-emitting diodes) are lit to 
indicate which card is defective. 



References 

1). Beyers, et al, ^'A 32b VLSI Chip," Digest of Technical Papers, 
1981 IEEE International Solid-State Circuits Conference. THAM 
9,L 

2. f . Mikkelson, et ali "An NMOS VLSI Proces s for Fabrication of a 
32b CPU Chip/' Digest of Technical Papers, 1981 IEEE Interna- 
tional Solid-state Circuits Conference. THAM 9.2. 

3, P.P. Merchant, 'Electromigration' An Overview," Hewlett- 
Packard Journal. VoL 33. no, 8. August 1982, 



Acknowledgments 



BiingrrTig these complex Technologjes to pToduclron fn late 1982 was the feSU^E Ol [tie 
cfefermination aio dedication oi many peopfB. Listed below are key conthbuTorB who 
transfofmed the initiar design gsaJs mto a production reafit>' 

The CPU chip despgn team mcludecJ Joe Beyers. Kevm Burkhart, Dave Conner. Marv. 
Fbrsy*h, MafX Hammer, Tony fliccio. Haman lalley, af>d Darius Ta^'^'tsBlvala Trie CPU 
microcode wa? written by Jim Bascoiiafo. Lee Gregory. Mike KoresaJ", Bill Kwsnn, 
Donovan Nicket, Rand Hfinfro^, and Larry Rupp Fred Gross and Ed Weher ((trole tl^e 
lOP microcode and Mark Cane pa. Ken HoiJoway, Bill JaHe. Rich Koch is. DaveMaifland, 
Gary Taylor, and Oon Weiss designed the lOP chip The memory con^rgiter chip was 
develope^Jby Joe Fuceiola, Ciif^ Lob Mark Uidwig, Bin Olson. Mark Reed. Tom Walley. 
stkI Jeff Yetter AigKander Elkins destgned the dock chip and Dale SeucJar. Doug 
DeBoer, Lou Dohse, CHariie Kohthardt. Jqhn Spencer, Bfll Terrell, and John VVheeler 
designed the 12BK-hjt RAM ctiip 

Hai Vo-Ba was responsifcle for the layout o* he 'instraies The Merrtory' Processor 
Moduiie's mechanical design ano the interna^ laoards were deajgr^ed by Madi Bowen, 
Jerry Kaufman, Jorm l^ftatt Seven Shands, Gary Tayfor, and Guy Wagner. Craig 
Morter*sen and Ed Wetjer developed ctisp desfgn tools whicti were run tiyt*ie computer 
operators — Bev Ramss. Binh RyPacKr ancf Kalhy Schraeder. Special ?e5l hardware and 
software were developed by Srady Bames. Richard Butter. Doug Fogg. Sob M\\ie^, and 
WaJt Nester. 

The ^fMOS process develop men! team included Ro<J Alley, Jtm Sarnes. Jeft Brpoks, 
Doug CriKjit, FLng-Sun Fei. Barry FerneliiHfS, Dave Fbtgefson. Tony Gaddts, Larfy HasIL 



Norm Hendrtci^&on, Ulrich Hess. Gary Horvg. Dan K^B-sier, Rajendra Kumar, Fred 
LafVlaster, Zemen Lebne-DengeL Rpck Luebs. Bob fJfanley, G^ol McConica, Jtm 
Mikkelson, John MoffatL Ken Monnig, Don T^ovy. David QumL Jtm Ftoland, Dana 
Seccanoe Jodi Ried^nger Sm«th, Psuf Uhm-, and Gene Zeifer 

The phoToStthograpntc technology wasdeve'Dped by Howard Atsratiam, Skip Augus- 
tine. Kent! Bartiert, Gary Hlliis. J. L Marsh. Rob Slu!2, Mark StQ?2. RickTsai. and Marty 
Wilson Dave AJIen. Kevin Funk and Glen Leinbach were responsible 'or the chip 
assembly process and ttie finstrate process was deveJoped by Rick Euke?, Deri Pratt, 
and Jetf Straw. T?5e ffi^iabifity of the chjps and Ih<e system was the responsittility of David 
Leery, Arun Ma^hotra, and Henry Sctiauer 

Fnis h\P Journal tssue focuses on. the R&D portion oi the rectinology devBiopmeni. 
Howewen the successful tabncation of VLSI ctiips m volume is equally detemnirtBd by 
ihe manufacturing organization that suppons it We would like to thank our manufactur- 
ing organizaijcn lor iheif enthusfaatic suppori, especially Ray Oatz^n% Qiff Doyle, Jim 
D^ehle. Gary Hgan, Jerry Hanmon. and John Mahomey. 

Special recognrfjon and thanks go to our secreteriea Cfirot Miter and t^vonne 
Gardner. 

HFs Cupertino Integrated Circuits Operation arxJ Hewtett-Packafd LaboraEores 
helped us solve seme ot our orocess deveioomen! o-roblents In addition, spactal 
recognition shouJd go to the key m;anagefs who continual fy supported 3 his develop- 
meni. They indude Jack Antiersdn, Doug Chance, Chrfe Christopher. IDonSchuLj, and 
Fred Wenninger 



6 HEWLETT-PACKARD JOURNAL AUGUST 19B3 



)Copr. 1949-1998 Hewlett-Packard Co. 



An 18-MHz, 32-Bit VLSI Microprocessor 

by Kevin P. Byrkhart, Mark A. Forsyth, Mark E. Hammer, and Darius F. Tanksafvala 



THE HEART OF HP's new 32-bit VLSi computer sys- 
tem is the Memory Processor Module. The ceatral 
processing unit in this module is an NMOS circuit 
containing 450.000 transistors on a single chip operating at 
a dock frequency of 18 MHz.^ This compact CPU chip, 
which implements a 32*bit version of the HP 3000 Comput- 
er's stack- based architecture, is designed and micropro- 
grammed to support multiple-CPU operations within a 
single Memory/Processor Module. Each CPU is capable of 
one- MIPS [million instructions per second) performance 
with very little performance degradation in multiple-CPU 
confignrations. 

Chip Organization 

Fig. 1 shows the layout of the major functional compo- 
nents on the CPU chip. The data path area containing 
the ALU, register stack, and memory processor bus (MPBJ 
interface is devoted to user- and system- level information 
processing. Two data buses within the data path link the 
ALU and the general- purpose register stack. 
Register Stack. The register stack contains 31 registers used 
for machine instruction handling, general -purpose data 
storage, system addressing, and system status. Three regis- 
ters are devoted to the machine instruction pipeline where 
special logic is included to predecode opcodes. Several 
registers in the stack hold the base and limit addresses for 
the data and program stacks in memor>^ Circuits are in- 
cluded to select the appropriate address base register au- 
tomatically when address offsets are computed. 

Four registers locally store the top values in the current 
data stack to allow fast access to often-used operands. 
Special-purpose hardware monitors one data bus for certain 
conditions such as zero, positive, and negative, and drives 
branch qualifier lines to the test-condition multiplexer. This 
data bus connects the register stack to the MPB interface, 




Fig. 1, Outfme of the 32-bit CPU chip indiCBting rnajor sec- 
Uons of the chip's afchiiectufe. 



which bandies all data transfers between the CPU and 
the memor}" and other processors. 

Since the MPB interface has its own data registers and 
control logic, internal CPU processes can initiate transfers 
and continue operation while the interface handles the 
MPB'g complex s>^nchronous protocol. The interface has 
dual-channel capability so that two completely different 
bus transactions can be in progress simultaneously* 
4^U. The arithmetic logic unit provides a wide range of 
single-state, 32 -bit arithmetic, logic, and shift operations. 
Operands can be selected from the main data path or the 
ALU's internal buses » and one of the operands can be com- 
plemented. The shifter proiides up to 31- bit right/left 
arithmelic or logical shifting during one clock cycle. The 
logic function unit performs the OR, AND, and XOR opera- 
tions on the operands and the adder provides their sum 
with carr^'-out and overflow^ bits. Master/slave result latches 
store intermediate results and return data to the register 
stack buses. 

Sequencing Register Stack. Control circuits dominate the 
center area of the CPU chip. This control area contains a 
programmable logic array (PLA) microinstruction decoder, 
a test- condition multiplexer, and a 14-bit sequencing regis- 
ter stack which generates the 14-bit microinstruction ad- 
dresses going to the control store. Address capabilities of 
the sequencing stack include short and long jumps^ sub- 
routine jumps and returns, traps to subroutines, address 
incrementing, and skips. 

A mapping ROM generates microcode start addresses for 
all machine instruction formats and opcodes. The CPU 
machine instruction mapper includes an opcode PLA that 
can be programmed to select opcodes from any combina- 
tion of bits in a 16-bit opcode. By altering the opcode PLA 
programming, the CPLJ can be remicroprogrammed to exe- 
cute other stack architecture instruction sets. The output of 
the opcode PLA is an address into one of the 256 14-bit 
locations in the mapping ROM. 

PLA, The central PLA decodes microinstructions and sends 
control signals to the data registers, ALU. MPB interface, 
and sequencing register stack. Microinstructions are di- 
vided into fields, each field specifying control for a differ- 
ent section of the CPU, 

Test-Condition Mui tip lexer. An integral part of the PLA is 
the test-condition multiplexer. This multiplexer uses one 
microinstruction field to select a control qualifier from the 
data path registers, ALU, or MPB interface. Conditional 
microinstruction branches are taken by using the qualifier 
to control the address issued by the sequencing stack. 
ROM. CPU control store consists of a 9216-word ROM or- 
ganized in 32-word pages. During each clock slate, the 
micropage address selects one page. A word address is 
issued during the following clock state to select one of the 
words on this page. With this ROM design, branches such 
as skips and short jumps on the current page execute with- 
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out interrupting the pipelined flow of microinstructions. 
Any jump off the current page results in one NOP clock state 
while the new page is selected. Microinstructions are trans- 
ferred from this ROM to the central PLA hy a 38-hit bus. 

Typical Operation 

The execution of a machine instruction begins when it is 
prefetched from memory and placed in the machine in- 
struction pipeline registers [see Pig. 2). As the currently 
executing instruction completes, this prefetched instruc- 
tion is moved up the pipeline into the next- instruction 
register and copied into the decoder- mapper in the mi- 
crosequencing hardware. Meanwhile, execution of the im- 
mediately preceding instruction is initiated and another 
instruction is prefetched. Finally, this instruction is trans- 
ferred to the current-instruction register and the appro- 
priate starting microcode address is issued trom the 
decoder-mapper. Instruction fetch, decode, and execution 
are performed in parallel except when a branch occurs. 

Microcode from the control store ROM implements all 
machine instructions and performs the prefetch to keep the 
instruction pipeline fulL The fields in each microinstruc- 
tion are decoded by the central PLA, which sends control 
signals to the registers and ALU to move and process data. 
The MPB interface's dual-channel capability allows the 
currently executing instruction to fetch data on one channel 
while the instruction prefetch is in progress on the other 
channel. 

Data fetched from memory is stored in general- purpose 
data registers in the CPU data path. Two parallel data buses 
within the data path simplify' the transfer of operands to the 
ALU and the MPB interface. During each clock cycle^ the 
ALU selects its operands and then performs an arithmetic 
operation and a logic or shift operation in paralleL Either or 
both results can be saved in the result registers, returned to 
the register stack, and/ or used as operands during the next 
cycle. More complex operations such as multiplication or 
floating-point arithmetic are accomplished by sequences of 
microcode. 

Features and Performance 

The internal CPU data paths and registers, which carr\^ 
and store user data and instructions, have full 32- bit widths. 
The CPU implements a stack- based architecture \vith a 
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machine instruction set consisting of 230 instructions in 
16"bit and 32-bit formats. Two 32-bit buses Unk the 31 
general-purpose registers with the ALU and the MPB inter- 
face. The ALU has two internal 32-bit registers and three 
internal buses. Typical IEEE-standard floating-point execu- 
tion time IS 5.94 pis for a 64- bit addition and 10.34 /jts for a 
64- bit multiplication. 

The CPU sends and receives data on the MPB which links 
the CPU to the other CPUs. I/O processors, and main mem- 
ory in the Memory/Processor Module. The basic data word 
is 32 bits, but byte, half-word , and double- word load and 
store instructions are supported within a direct 500- 
megabyte address range. 

The microinstruction bus linking the control store ROM 
and the PLA decoder transfers one 38-bit microinstruction 
every 55 ns. The control store ROM on the CPU chip con- 
tains 350K bits divided into 288 pages. Each page contains 
32 words, each 38 bits wide. This ROM has a 70- ns access 
time, which includes a 20-ns final word select time. 

Special microinstruction sequencing hardware provides 
addresses to the control store ever>' 55 ns and minimizes the 
use of microcode fields for address controL Conditional 
jumps and subroutine calls in microcode are handled by the 
sequencing hardware to off-load these tasks from the ALU's 
main data path . The sequencing stack contains six registers, 
three incrementers, a comparator, a lO-by-14-bit trap 
ROMt a 2 5 6- by- 14- bit mapping ROM, and an opcode PLA 
with 16 inputs. 120 product terms, and 8 outputs. The 
sequencing stack is interconnected by two 14- bit buses and 
one 5- bit true-complement word-select bus, 

The pipelined microinstructions are divided into seven 
fields of five or six bits. Different microinstruction formats 
multiplex the different fields and constants into a single 
38-bit word to enhance the efficiency of microcoded 
routines. These formats are decoded in the PLA based on 
the opcode in the 'speciar field. The PLA microinstruction 
decoder, w^hich consists of 55 inputs, 508 product terms, 
and 326 outputs, performs two-level decode logic in 55 ns. 

Design for Testability 

The com pi exit}' of the chip presented some very difficult 
testing challenges — fault coverage of a 450H000*transistor 
circuit with nearly 300,000 nodes, characterization at clock 
rates up to 24 MHz, verification of 350K bits of on-chip 
firmware, and providing process feedback from the first 
design in a new IG technology. Relying on commercially 
available LSI testers to solve these problems was not feasi- 
ble because of their high cost, limited interactive diagnostic 
capabilities, and performance limitations. To pro^dde a fast 
screen and detailed diagnostics under realistic operating 
conditions at low cost, it was necessar>^to incorporate most 
of the needed test capability into the chip's design. 

Several key concepts are involved in the built-in testabil- 
ity of the CPU chip. A structured design methodology and a 
bus-oriented architecture allow^ snbstantial partitioning. 
Since all of the inputs and outputs of the individual circuits 
are connected to at least one of the major internal buses, 
ever}?^ circuit can be individually controlled and observed 
by communicating with oidy a small number of data and 
control buses. A structured design separates circuits into 
distinct functional blocks, and a building-block approach 
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Instruction Set for a Single-Chip 32-Bit Processor 



by James G, Fiasconaro 



Fitting the entire CPU for a powerful 32'bi\ computer on a single 
manufacturabte IC was a formidable task by any standard This 
task was accompli siied. in part, by encouraging the engineers 
who were designing and impternenting the hardware and the 
instruction set to make the necessary tradeoffs between the two. 
but always with a thought towards the performance of the 
resulting chip. The present design is the result of many optimizing 
Iterations. The hardware contains thirty-nine 32'bft registers, a 
32-bit shifter, a 32-bit ALU, and 9K 38-bit words of microcode 
control store It executes microcode at an 1 8-MHz rate. 

The instruction set is stack-onented Each program has its own 
execution stack for allocating focal variabies. passing parameters 
to other procedures, saving the machine state on procedure calls, 
and evaluating expressions. There are instructions for pushing 
data onto the stack from memory, and for popping data from the 
stack and stonng it in memory. Arithmetic instructions operate on 
the uppermost data words in the slack and leave their results on 
the stack. Instructions that operate from a set of parameters get 
these parameters from the top of the stack 

Segmentation is used to support virtual memory in the CPU 
instruction set. Every program can use up to 4096 code segments 
and 4096 data segments, and must use at least three 
segments^-a code segment, a stack segment, and a global data 
segment (Fig. 1 ). Three pairs of 32- bit registers on the CPU point 
to the start and end of each of these three segments, These are 
the base and itmit registers shown in Fig. 1 , Another register, the 
program counter, points to the current instruction in the code 
segment, and two other registers point into the stack segment 
The Q register points to the most recently pushed slack marker 
and the S register points to the uppermost 32-bit word in the stack 
Four other registers on the CPU are used as a cache memory for 
the top four words in the stack, greatly reducing the number of 
reads and writes necessary to maintain the stack in memory. The 
information required lo manage the segments used by each 
program is maintained in menxtry-resident tabies, Each program 
has its own code and data segment tables and one common set of 
system code and data segment tables is shared by all programs. 

Each code seg ment table entry contains the location and length 
of the segment, an absence bit, a privileged mode btt, a reference 
bit and a use count. The use count indicates how many CPUs in 
the system are using the code segment at each point in time. Two 
primary instructions usmg the code segment tables are PCL (pro- 
cedure caN) and EXIT. PCL pushes a four -word stack marker, 
which contains the inde)< register the status register, the offset to 
the preceding stack marker, and the return address, onto the 
stack and transfers control to the new procedure. EXIT does the 
reverse and returns lo the calling procedure. Both instructions 
also do a considerable amount of error checking. 

Each data segment table entry contains the location and length 
of the segment, an absence bil a privileged mode bit, a reference 
bit, a dirty bil, a write enable bit, a paged bit. a page-size field, link 
information, and a use count Unlike code segments, data seg- 
ments can be paged (with the exception of a program's stack and 
global data segments), Each program can access up to 4096 
data segments through an external data segment pointer If the 
segment is not paged, this pointer is interpreted as a 1 2-bit 
segment number and a 19- bit offset within the segment. (Segment 
length can be up to 2^» bytes.) If the segment Is paged, this 



pointer is interpreted as a 31-bit virtual address with a IS-t" 
segment number, a page numtjer. wxi an offset within the page 
The page size can be chosen by the operating system in powers 
of two up to 2^* bytes. For paged segments, the data segmient 
table entry points to a page table that contains a two-word entry 
containing location and status information for each page. Un- 
paged segments can be linked together and treated as a single 
logical entity, erther by alEocating the individual s^ments in con- 
secutive data segment table entries, or by letting each data seg- 
ment table entry point to the next entry in the chain, 

Because the instruction set is stack-oriented, many instructions 
(e.g., ADD, sue, and, and OR) operate on the uppermost words in 
the stack and do not require any source or destination specrfica- 
tlon. Instructions that push information onto the stack and pop 
information from the stack use direct, direct indexed, indirect, or 
indirect indexed addressEng. Direct addressing uses a base reg- 
ister and an offset specified in the instruction- Direct indexed 
addressing is similar except that the index register (a 32-bit 
two's-compEement byte offset) is also added. Indirect address- 
ing starts with the direct addressing calculation and fetches the 
indicated word from memory. This word is interpreted as a stack 
segment pointer, a global data segment pointer, or an external 
data segment pointer Stack and global data segment pointers 
are simply offsets from the stack base and data base registers. 
External data segment pointers are evaluated through the data 
segment tables as described earlier. Indirect indexed addressing 
is like indirect addressing except that the index register is added 
after the indirect pointer is evaluated. 

The instruction set provrdes a full repertoire of load and store 
instructions for bit, byte^ half-word, word {four bytes) and 
double- word quantities using the addressing modes just de- 
scribed. All memory accesses using these instructions are 
bounds checked against program base and program limit, stack 
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base and S register, data base and data limit, at the location and 
length information in a tabie as appropriate. A bounds violation 
causes a trap to tine operating system. Stores into code segments 
are not allowed. In unprivileged mode, a user can access only the 
user's own code, stack, global data and external data segments. 
The instruction set also provides a set of privileged load and store 
mstructions which use absolute addresses instead of segment 
base and offset informaiion to access memory 

The primary data types supported by the instruction set are 
integers, floating- point numbers, and byte strings. Integers can 
be either 16-bit or 32-bit two's-complement numbers, 32-bit un- 
signed integers, or eight-digit unsigned decimal integers. The 
basic operations for add, subtract multiply, divide, negate, com- 
pare, shift, and rotate are provided along with provisions to facili- 
tate mult precis ion {i,e., greater than 32- bit) integer arithmetic. 
Instructions that use one integer from the stack and an 8- bit 
immediate operand in the instruction are also provided. 

Two types of floating-point numbers are supported. The first 
type includes 32'bii and 64-bit IEEE-Standard binary floating- 
pojnt numbers. The standard is met by supporting performance- 
crplical operations directly in microcode and all other operations 
either directly by the operating system or by traps from microcode 
to the operating system. The second type is a 17'digit decimal 
format, only conversions between this format and the 64'bit 
IEEE- standard format are supported. 

Both structured and unstructured byte string operatbns are 
supported. Unstructured stnngs are simply byte arrays. A set of 
rnove, scan, and translate instructions is provided to support 
this data type. Structured byte strings correspond to the string 
data types found in most high-level programming languages. 
These strings are accessed through a four- word string descriptor 
containing a pointer to the string, its maximum length, an index to 
the first byte of interest, and the number of bytes of interest, The 
current length ol the string is stored in the first four bytes of the 
byte array containing the smng, instructions to load, concatenate, 
validate, and assign structured byte stnngs are supported. 

The instruction set interacts with the operating system in two 
primary ways. The first way is through traps to code supplied by 
the operating system. When the microcode encounters a situation 
that it cannot handle. It traps to a prearranged entry point in a 
prearranged code segment. There are two broad categories ol 
traps. The first category consists of error conditions. Examples 
include segment bounds or table length violations, privileged 
mode violations (attempts by unprivileged programs to execute 
privileged instructions or access privileged information), integer 
divide by zero, and system errors. The second category consists 
of situations that require operating system intervention. Examples 



include absent segments. page$« and page tables, stack over- 
flow, floating-point mathematics traps, attempts to execute unlm- 
plemented instructions, and traps to support a set of htgh-level 

language debugging aids, 

The second way the microcode interacts with the operating 
system is through a set of instructions. These instructions are 
pnmartly involved with task dispatching and I/O. This approach 
supports getting to and from the dispatcher and I/O driver code, 
assists some of the low-power operations which the dispatcher 
and I/O drivers must perform, and provides a special stack for the 
dispatcher and I/O drivers The details of the algorithms used in 
the dispatcher and I/O drivers were left for the operating system to 
implement in machine code. Thfs approach provides a good 
tradeoff between speed and flexibility. 

The I/O interrupt handler provides sixteen I/O interrupt levels. At 
each level, I/O interrupts are handled on a first-come-first-serve 
basis. This is accomplished in cooperation with the t/0 processor 
( I O P) ch ip by maintaining a linked I ist of all of the dev ices wait ing 
for service at each priority level, The lOP togs devices at the end of 
each list and the CPU removes devices from the head of each list. 
Finally, provisions are made so that any CPU in a multiple-CPU 
system can handle any I/O interrupt. 

Table I lists typical instruction times for a fev^ CPU instructions. 
However, these times do not tell the whole story because up to 
three CPUs can be included in each ts/lemory/ Processor Module. 
Support for multipte-CPU systems was built into the instruction set 
from the very beginning , This support occurs primarily in the areas 
of dedicated memory locations, interrupt handling and manipula- 
tion of the code and data segment tables in memory. This support 
guarantees exclusive access to system information when neces- 
sary and lactlitates implementation of efficient memory manage- 
ment in the operating system. 



Table I 
Typical Instruction Tim^ 

Instruction Time (/iS) 

Direct Load 0.56 

Integer Add 0.28 

Integer MulUply 2.9 

Integer Divide 9.4 

64-Bit Floating-Point Add 6.0 

64- Bit Floating-Point Multiply 10.4 

64-Bit Roating-Point Divide 16.0 

Procedure Call (to same segment) 3.3 

Procedure Call (to different segment) 7.8 



limits the number of different blocks. 

To use these architectural features for testing purposes > a 
small amount of diagnostic support circuitry was added to 
the chip. The microinstruction register, one data register 
connected to an internal data bus, and an internal opcode 
bus were modified to allow loading or dumping serially 
through a single bus line- These registers can directly or 
indirectly control all of the internal data, address and con- 
trol signals on the chip, Modifications to the microsequenc- 
ing state machine pro\ade the ability to halt or single-step 
microcode execution in a manner transparent to the micro- 
program being executed. This is done by using latches on 
all test qualifiers and recirculating data on internal buses. 

A diagnostic interface port was added to facilitate control 
of the internal test features. This port consists of seven of the 



CPU chip's wire- bond pads: four opcode- bit pads, a serial 
UO pad. an output pad» and a synchroniser input pad. The 
four opcode bits are connected to PLA inputs and used to 
control the serial shift registers and alter normal microin- 
struction sequencing. Data is loaded into and dumped from 
the interna] registers via the serial L'O pad. The output of the 
test multiplexer can be observed via the output pad. The 
synchronizer input pad allows asynchronous communica- 
tion between the diagnostic port and an external tester. The 
opcode bits are only executed once each time the syn- 
chronizer input is pulsed, which enables a relati%^ely slow^ 
computer to communicate with and control a CPU running 
at a much higher clock frequency. 

These features form an extremely pow^erful set of diag- 
nostic tools. Operations that can be controlled through the 
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diagnostic port include setting microcode breakpoinls. 
single-stepping microccMie, loading and examining inter- 
nal registers* and executing external y supplied microcode. 
The chip partitiomng allows testing and characterization of 
a single circuit regardless of whether other circuits on the 
chip are functionaJ. This capability proved to be essential in 
veri^ing a design of this complexity. 

Testing chips in a production environxnenl requires a 
high-speed pass'fail screen. To do this, a 100K*bit self-test 
microprogram %vas encoded into the CPU's ROM This mi- 
croprogram executes in twenty million clock cycles and 
outputs a series of pulses through the diagnostic port to 
indicate functionality of each major section of the chip. In 
addition to the standard instruction set. the self-test uses a 
set of microinstructions designed specifically for testing. 
Greater than 95% coverage of ^stuck-at..,./ faults is 
achieved, and a variety of other potential defects such as 
storage node leakage, pattern sensitivities, and timing prob- 
lems are covered as well. The self- test microprogram is 
executed whenever the chip is powered up. so it can be used 
for system verification and field tests besides wafer tests, 

A feature of the architecture allows the CPU to communi- 
cate with itself via independent pad drivers and receivers 
connected to each of the MPB interface pads. Functional 
pad testing can thus be accomplished without the need for 
external, expensive, high-speed test electrorucs. However. 



if required, various loads can be connected to the circuit's 
pads during testing to simulate a system environment. 

System- level hardware and software verification are also 
addressed by the built-in test features. A flip-flop controlled 
through the diagnostic port can put the CPU in a mode 
where it enters a transparent idle state at the completion of 
each machine instruction. This allows instructions to be 
single-stepped. Special microcode routines to provide 
breakpoint, variable tracing, and other software verification 
features are programmed into the CPU's ROM, Low-level 
system debugging can be done by executing microinstruc- 
tions supplied through the diagnostic port to drive and 
monitor the system bus. 
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VLSI I/O Processor for a 32-Bit Computer 
System 

by Fred J: Gross, Witfiam S. Jaffe, and Donald R. Weiss 



HPS 32-BIT VLSI computer system requires a high- 
performance input/output data path. The design 
objectives for the 1/0 path were to provide high data 
rates to peripheral devices to match the high performance of 
the CPU and to minimize the design effort. An 1/0 processor 
[lOP] able to control most I/O transactions without interfer- 
ing with the CPU was chosen because it met the perfor- 
mance objective, and by using the same circuits and basic 
structure as the CPU chip, also met the second objective. As 
a side benefit, the first production runs of each chip served 
to test the other chip's design and establish a common 
reliability record for the shared circuits. 

The L'O processor has an I/O bus bandwidth of 5.1M 
bytes/s when transferring at maximum rate. The lOP is 
capable of addressing eight device adapters, also known as 
110 cards. Each device adapter has its own DMA (direct 
memory access) resource. There are sixteen levels of inter- 
rupt assignable to device adapters. The lOP is also capable 
of independently executing simple channel programs. 



A microcode-controlled state machine gives the 1/0 pro- 
cessor enough power to perform all of its I'O tasks. A 3B-bit 
microcode word with eight subfieids allows simultaneous 
control of the I/O processor's internal registers and external 
control lines. 

Operation 

Operation of the I/O processor is directed by the CPUs in 
the computer system. The lOP alternately checks for a 
command from any CPU in the system or for a valid service 
request from any enabled device adapter. A CPU communi- 
cates with an lOP by sending it a command word and a data 
word. Embedded in the command word is the requesting 
CPU's return address. This allows all CPUs in a system to 
use any lOP, Commands sent from a CPU can set up DMAs, 
read registers on the lOP. or do direct I/O with a device 
adapter. Complex tasks that an lOP cannot handle indepen- 
dently result in CPU interrupts. 

The lOP is connected to other processors and memory via 
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the memory processor bus (MPS). The MPB interface is a 
3Z-bit pipelined interface with a synchronous protocol that 
allows overlapped memory fetches- It has its own registers 
and control logic, which hide its complex protocol from the 
lOP's register stack and control logic. This improves per- 
formance by allowing internal operation in parallel with 
memory operations. The MPB interface on the lOP chip is 
identical to the one on the CPU chip. 

The IOP*s 1/0 bus connects it to the device adapters. The 
new ilO bus protocol for the 10 P is called HP- CIO for 
Hewlett-Packard Channel Input Output. The protocol was 
defined during the development of the lOP to provide a 
processor- independent, message- oriented bus. 

During an lOP poll cycle, a dexnce adapter enabled for 
service requests asserts a data line corresponding to its 
assigned address. The lOP latches the I/O bus responses, 
masks out any disabled devices, priority encodes the re- 
sults, and then services the highest numerical address. 

Service consists of transferring bytes or half-words ft wo 
bjrtes) which can be either data or commands. During the 
transfer, the address lines select one device adapter and the 
data direction line indicates who will be the data sender. 
The end of a transfer is signaled by the trailing edge of the 
1/0 strobe lOSB, which the device adapter uses as a clock 
when receisdng data or as a signal to assert the next data 
when sending data. 

A poll cycle on which a single data transfer occurs is 
called a multiplex cycle and a cycle on which multiple data 
transfers occur is called a burst cycle [Fig. 1). A burst cycle 
increases I/O bus bandwidth because more bvtes are trans- 
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Fig. 1. Timing diagram for (a) muftipfex and (b) hurst I/O 
cyctes in the ilO processor. 



ferred per poll cycle. The bandwidth reaches a maximum of 
5.1M bytes/s in burst mode and 97 3 K bytes/s in multiplex 
mode. When a poll is won by a device adapter for DMA, it 
has the option of asserting burst request BK- A device adapt- 
er in burst mode can take any number of transfers between 
two and thirty-two by asserting and then unasserting BR at 
the appropriate times. To reduce the lock-out time to an 
acceptable level, the lOP limits the number of txansfers per 
poll cycle to no more than thirty-two. 

The width of the data word (byte or half-word) on the I/O 
bus is determined by the data sender. If the lOP is trans- 
ferring in byte mode, channel byte CB is asserted to indicate 
to the device adapter that only the least- significant eight 
bits of the data bus are valid. If the device adapter is 
transferring in byte mode, device byte DB is asserted- 

CPU interrupts are usually the result of either a DMA 
termination or a device adapter service request that cannot 
be handled by the I OP. When an interrupt occurs, the lOP 
records a device adapter interrupt request at the end of a 
linked list in memory for the interrupt level it is on. [This 
level is assigned by the CPU and stored in the status register 
for a particular device . ] A message i s then sent to the CP U to 
indicate tliat an interrupt was recorded for that particular 
interrupt level When the CPU completes its current in- 
struction, it services the highest-level list, starting at the 
list's beginning. 

The CPU can configure itself to accept all interrupts » all 
interrupts above a certain ievel, or no interrupts - Each lOP 
has a register for enabling interrupts for any or all device 
adapters. A register on the lOP determines whether a par- 
ticular CPU gets the interrupt request, or if all CPUs in the 
system get the interrupt request. In the latter case, the first 
eligible CPU available services the interrupt. 

CPU commands not requiring a response can be placed in 
a list in memory for the 10 P to execute- These lists are called 
channel programs- Each entry consists of a command word 
and a data word. The fourth word of the device reference 
table contains a pointer to the next executable command in 
the channel program. Each device adapter for every lOP has 
its own unique table in memory. When a status bit is en- 
abled for a particular device adapter, the iOP executes one 
command per poll cycle when there are no CPU commands 
or service requests. A typical channel program allows mul- 
tiple data transfers from different memory addresses to take 
place %vithout interrupting the CPU. The logical completion 
of a channel program usually results in an interrupt, 

I/O Processor Design 

The IOP consists of a microcoded control section imple- 
mented with an internal ROM, an address sequencer, and a 
PLA decoder, a register stack of 44 registers connected by a 
common bus, the MPB interface, and an 1/0 interface. A 
block diagram of these sections on the IOP chip is shown in 
Fig. 2- 

The control store is a 46Q8'by^38-bit, series^FET ROM 
wth two-state pipeline access. In the first state of the 
pipeline> a page address is issued to select one 32-word 
page of the 144 possible pages. In the second state » the word 
address selects one of the 32 words on the selected page to 
be transferred to the PLA via a 38-bit bus. Branches within 
the current page do not interrupt the pipeline timing be- 
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cause the new word address is selected in the first state. 
Only jumps off the current page cause the pipeline to be 
restarted- The lOP only needs a ROM one-half the size of the 
CPU ROM. Structuring the CPU ROM into two equal arrays 
simplified the conversion to the lOP design. 

The ROM address sequencer computes the 13-bit address 
of the next location to be fetched from ROM, In norma I 
sequential access the previous address is incremented, but 
nonsequential addresses can be selected from either the 
previous instruction's branch target, the top of a subroutine 
stack, or a trap ROM. The address sequencer circuits axe the 
same as those used for the CPU. but to conserve spacCj the 
opcode mapper circuit is deleted. 

The PL A decodes the microcode words from the control 
store ROM and generates over 230 signals to control the 
lOP. The PLA is implemented with dynamic KOH-NOR logic 
for high performance^ high density, and low power con- 
sumption. The lOP PLA is specially programmed for the 
lOP architecture^ but the low-level building- block circuits 
for its design are identical to those used in the CPLFs FLA, 

The test-condition multiplexer controls conditional 
branching in the ROM address sequencer. It consists of 
static latches to sample and hold status information or ex- 
ternal quaiifierst and a series-FET multiplexer to select the 
proper qualifier for the conditional branch. 

The eight subfields in the 38-bit microcode word are 
classified as test, special . bus drive, bus receive, and four I/O 
controls. This word structure allows handling the unusual 
task of simultaneously directing data flow internal to the 
lOP while providing the appropriate 1/0 timing signals. In 
many cases I/O timing is adjusted by merely adding or 
deleting NOP (no operation) words to the microcode. 

The register stack is made up of registers from 4 to 32 bits 
in length and a logic unit. The registers are divided into an 
active set that contains information about the DMA cur- 
rently in progress, and a storage set that holds DMA infor- 
mation for ail eight device adapters when DMA is not ac- 
tive. The active DMA registers consist of a memory address 
register with an increm enter, a count register with a decre- 
menter, a burst count register with a decrementer, a status 



register with bits testable by the test-condition multiplexer, 
and an I/O data transfer register. When a DMA becomes active, 
the memory address, count, and status register values are 
transferred from the storage set to the active set. Each device 
adapter has a data register on the lOP chip to eliminate the 
need for a memory access before a data transfer. Response 
time to a DMA request is greatly improved since all infor- 
mation is contained on the lOP. When the transfer or trans- 
fers are completed, the new register values are stored in the 
storage set and the data buffer is filled or emptied. 

The logic unit on the lOP replaces the powerful ALU 
found on the CPU, The bit-set/clear function performs a 
logical AMD, OR, or exclusive-NOR between the lOD and 
GPO registers and places the results on the common bus. 
The constant function can set any of the sixteen least- 
sigiuficant bits on the common bus. The compare function 
compares 32> 16* or 8 bits of the lOD and GPO registers and 
sends the results to the test-condition multiplexer. 

The I/O bus hardware consists of data drivers and receiv- 
ers, address register and drivers, control drivers, and Input 
qualifiers. The data drivers buffer the lOD register contents 
for output and are in a high-impedance third state during 
input operation while data is latched in the lOD register. 
The pad driver has a push-pull output stage designed to 
drive a load of 15 to 20 pF. The lines are buffered by 
external high-speed bipolar devices. This design lias the 
advantages of being able to drive a iarge bus capacitance 
quickly without requiring a large IC chip area and of isolat- 
ing the MOS lines from damage caused by static discharge. 
Input receivers consist of a protection device and a re- 
generative latch. The latch ensures proper system operation 
by resolving the input level before the internal test- 
condition multiplexer tests iL 

The diagnostic port is a seven- line serial interface used 
for testing and diagnostics. It is identical to the one used on 
the CPU and allows the GPU and lOP to be tested on the 
same custom tester, 

Self-Test 

When power is applied, the lOP turns on its self-test 
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indicator and performs a self- test of its hardware using 
microcode routines programmed in its ROM. The routines 
are classified as internal test. VO interface test, channel-to- 
channel test, and memory test. The internal test function- 
ally tests all registers, operations, and sequencing. The L'O 
interface test sequentially sets and clears all output control 
lines and tests their level by a separate input. All inputs are 



driven high and low by special outputs and tested to ensure 
that they are functionaL The channel-to-channel test causes 
the I OP to send data to itself via the MPB, Finally, if the CPU 
sends a message to the lOP indicating that there is working 
memorj^ in the system, the TOP tests its ability to write to 
and read from memory. After the successful completion of 
all tests, the IQP turns its self- test indicator off. 



High-Performance VLSI Memory System 

by Clifford G. Lob, Mark J, Reed, Joseph P< Fucetola, and Mark A. Ludwig 



I MPLEMENTING A HICH^PERPORiMANCE memory for 
I HP's new 3 2 -bit VLSI computer system requires the 
~ achievement of several important design goals to realize 
the full potential of this VLSI architecture. A dense resident 
memory and a large virtual address capability is desirable. 
A large memory bandwidth is needed to support multiple 
CPUs and I/O processors (iOPs) without creating 
bottlenecks. Also needed is the ability to do flexible 
memory operations such as byte^ half-word* word, 
semaphore transfer, and refresh functions that are 
transparent to tfie CPU, lOP, and operating systems. 



Fig. 1 show^s a block diagram of a memory card for the 
32-bit VLSI computer system. The key elements are the 
memor>^ processor bus [MPB)^ MPB interface, memory con- 
troller chip, 128K-bit dynamic RAiM chips, and clock chip. 

Each memory card has twenty RAM chips organized in 
four rows of five chips each. Each RAM chip supplies 128K 
bits of memor>f storage and the memory card provides 256K 
bytes of total storage. Thus, a xVlemory /Processor Module 
can contain up to 2.5 megabytes of memory if it uses only 
one CPU and one iOP* and memory cards are inserted in all 
of the module's ten remaining empty slots. 
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To achieve a large virtual address space, the 32-bit ad- 
dress has three bits of format controh allQ\%'iiig the remain- 
ing 29 bits to be used for addressing up to 2-^ b\^es. In 
addition, virtual memor\^ support is provided by the CPU's 
microcode and instruction set. 

The memory controiler chip and the RAN-Is communicate 
via an 8-bit memory- address bus, the 39- bit memory- data 
bus, and a chip select (CS) line. Each row of five RAM chips 
has its own CS line, and the memor\' controller chip is 
coimected to each row. Except when doing a refresh, the 
chip asserts only one CS line at a time. The memory address 
bus (MAB] and CS lines are driven only by the memor>^ 
controller chip. The memor^^ data bus (MDB) is bidirec- 
tional: write data is driven by the memor^^ controller chip 
and read data is driven by the RAMs, 

A large memory bandwidth is achieved through the MPB 
interface protocol- the pipelined nature of the RAM chip, 
and 18-MHz operation. Fig, 2 shows the timing for three 
read cycles, The internal pipeline design of the RAM allows 
it to accept a second address before handling data for the 
first address, and to issue read data nine million times per 
second. This allows the processors to issue three nonse- 
quential data address requests without waiting for the first 
data word. Multiple processors, through a priority polling 
scheme ^ can interleave data. 

After the polling sequence is completed, a memory ad- 
dress is sent on the bus and a read operation is indicated. 
The memory controller issues an 8-bit X address and a 6-bit 
Y address in succession on the memory address bus and 
generates the appropriate chip select CS. The RAM then 
decodes the address and outputs data onto the memory data 
bus. The memory controller corrects and aligns the 39-bit 
data word from the RAM row and outputs a 32-bit data word 
to the memory processor bus. 

A single CPU can use no more than 65% of the bandwidth 
of the memory system. During normal operation, a CPU 
uses 30% of the bandwidth. In a system with multiple CPUs 
or CPUs with lOPs, the full bandwidth of 36M bytes/s can 
be completely used. 

Important to packaging of the memorj^ system is the 
finstrate board onto which the memory controller, RAM, 
and clock chips are mounted. Using forced -air cooling, the 
junction temperatures of the RAMs on an active memory 
card will not exceed 90°C, even under the following worst 
case conditions: SS^'C ambient, 15,000 ft altitude, low fan 
voltage, and a fully loaded Memory/Processor Module. 
These low junction temperatures contribute to the excellent 
reliability of this memory system. 



Flexible memory operations and high reliabilit}' and 
availability are implemented in the memorj^ controller 
chip. This chip is controlled by a PL A [programmable logic 
ajray) for speed. It contains three separate synchronous 
state machines that control self- test, 'healing/ and normal 
memoty controller operations. The chip dissipates up to 
five watts and has a total of 119 wire-bond pads. 

In addition to refreshing the RAM chips, the memory 
controller performs the following functions; 
n Aligning (reading and writing) of bytes and half-words 
a Implementing semaphores by using the RAM capability 

of reading and writing in the same cycle 
8 Mapping logical addresses to physical memory^ 
- Correcting single-bit errors and detecting double-bit er- 
rors on the fly 

Healing bad memory' locations by replacing them with 
other on-chip memory locations 
: Testing itself and the RAM chips. 

Memory Controiler Chip 

Fig. 3 shows a detailed block diagram of the memory 
controller chip. The MPB interface handles the MPB pro- 
tocol and routes addresses and data into and out of the chip. 
The mapper contains 32 CAMs [content addressable 
memories) and issues chip selects and part of the Y address. 
The MAB/CS drivers and multiplexer handle time multi- 
plexing of X (row) and Y {column} addresses and read and 
uTite cJiip select signals. The MDB drivers and multiplexer 
handle time multiplexing of read data from, and write data 
to the RAMs. 

The healer block also contains 32 CAMs, When an error is 
detected in memory, the healer places the physical address 
of that location in one of its CAMs so that substitution will 
be made for all subsequent accesses to that address. 

The data manipulation section contains a Hanmiing en- 
coder which attaches seven check bits to the 32-bit write 
data, a Hamming decoder ivhich detects the position of a 
single-hit error and the existence of a double-hit error, a data 
corrector which corrects the single-hit error^ and byte align- 
ers which extract bytes and half-words from memory in 
nonword read operations and place bytes and half-words 
into memory for nonword write operations. 

The MPB protocol is based upon polling for address/data 
bus cycles, and a master-slave synchronous handshake. 
During power-on, each CPU and lOP is assigned a nonzero 
channel number based on its physical position In the 
Memory/Processor Module, This channel assignment can 
be altered by the operating system to give highest priority to 
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the processors requiring the most bandwidth for their tasks. 
The protocul allows for eight priority-assigned channels 
with the highest priority and 7 the lowest priority. 

Each memur}^ controller is hardwired to channel 0, and is 
given a unique number [MC#) by the power- on procedure. 

Figures 4 a!id 5 show protocol timing for read and write 
operations. The highest-priority channel responding to b 
poll \vins the bus cycle, asserts the address on the next 
state,* asserts MCTL [master control) indicating a valid ad- 
dress, and, il'this is a write operation, asserts WDBE (w^rite/ 
double-bit error). Two states later, the addressed slave 
asserts SCTL (slave control) to signify that it recognizes 
the address and currently is not busy- Three states after that, 
the data is asserted on the bus — by the slave if the transac- 
tion is a read, and by the master if it is a write. The slave 
asserts .SCTL to signify that it can complete the transaction 
and the master asserts MCTL to signify that it can complete 
the transaction. 

As a processor on the MPB, the memory controller chip 
has many characteristics very different from the GPU and 
lOP chips. Its master functions are simply to broadcast a 

*ln ttiis article, one bu£ cyc^e "s equa> Eo two states 



message to the system, and to grab bus cycles for refreshing 
memory' and for write operations. These chips are resident 
on charmel to guarantee that they win the bus poll cycle 
for these operations. In designing the chip, it %vas consid- 
ered important that any master or slave processor functions 
interleave cleanly with pipelined memory accesses. 

Each row of RAM chips on a memory finstrate provides 
16K words of 39 bits each. Each word consists of 32 data bits 
and seven check bits w^hich make up a modified Hamming 
code to allow single-bit and double-bit error detection and 
single-bit error correction. The 40th bit is not used, 

A read address asserted on the MPB causes data to be 
returned five states later [Fig. 2). This mcludes time needed 
by the chip to perform its mapping fimctions. error detection, 
and data alignment. RAM access time is three states, How^- 
ever, a new access can be initiated every tw^o states to give a 
110-ns cycle time. The R.'^M is pipelined so that a second 
access can be started w^hile another is stiU in progress. 

Read Memory Operation 
Fig. 4 shows timing for memory read operations. VVheu 
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an address is placed oiUhe \1PB. it b automatically placed 
on the AND (address, data) bus during PiD (^1 data). Parts 
of the address on the AND hus are processed simultaneously 
by many parts of the memorv' controller ciiip. An internal 
register access decode section checks to see if the channel 
field [bits 3 to 3) equals 0. ft also captures address bits 6 to 10 



and 23 to 29, \vhich are perlinen! to determining which 
memor\^ controller chip register Is accessed. The CAMs in 
the mapper compare bits 3 to 17 to thetr contents. Mean- 
while, bits 18 to 29 goto the MAB/CS section and the healer, 
and bits 0, 1. 2. 30. and 31 go to the control PLA. In the 
MAB/CS section, bits 22 to 29 go out immediately to the 



18-MHz Clock Distribution System 



by CHfford & Lob and Alexander O. Efkins 



Deigning the fiigh-frequency distribution system to allow HPs 
new 32'bit VLSI processor to operate at 18 MHz proved to be a 

significant design chaitenge The chips required 6V, two-phase 
nonoveriapping docks with rase times less than 6 ns and 
overslioob undershoot (ess than 1V. it was decided early in the 
project that, because of area constraints, the processor chips 
woutd not buffer their clocks. However the RAM chips do provide 
some buffering. Hence (he capacitive loading componenis vary 
frorn approxirnalely 300 pF per phase for a CPU chip to approxj- 
mateiy 30 pF per phase for a RAM ch ip. in addition, the capacitive 
(oading presented is highly vanable because of the dynamic 
circuits used and depends on which circuits are active. Worst- 
case tolerances produce capacitive specific ations that can vary 
±30% and cause unbalanced toads on each phase. 

The first step in the design of the clock distribution system was 
the clock buffer chip. The clock bofier chip divides a 36-MHz 
signal and produces the two-phase, nonoveriapping clocks <^1 
and <f>2. Large capacitive drive is required since the RAM tinstrate 
can Eoad the ciocks with 1500 pF per phase. In addition, the 
docks are required to use a system sync signal to, ensure that </i1 
occurs on all frnstraies simultaneously. 

The chip size fS 3.S4 by 3.65 mm, Each large output transistor 
on the chip has a channel approximateiy 55.000 ^m wide by 2.1 
^^m long and an output trnpedance of 0.5 ohm. 

Fig. 1 shows a clock chip bonded to a finstrate and surrounded 
by chip capacitors used to reduce inductance and to bypass the 
supplies and ground, Peak currents of 2 to 3A occur when the 
Cfock switches, MuitipJe bonds interleaved with power supply and 
ground signals and multilayer chip metallization are used to re- 
duce inductive and resistive effects 

SInp-line and microstrip techniques are used to distribute the 
clocks to the other chips on the finstrate. Careful attention was 




given to minimising inductance because to achieve the clcsck 

specifications under worst -case variations, there must be less 
than 12 nH in series from the biiffer to any chip For comparison, a 
single wire bond contributes at)out 4 nH. and a 2Hnch loop of wire 
is about 160 nH. Another indue ran ce- reducing technique is the 
use of multiple taps per clock phase on the processor chip. 

Fig. 2 shows actual cfock waveforms as distributed with an 
earlierstraightfonA^ard wmng approach, and as achieved wfthttie 
tuned higb-frequency design currently in use 




Fig . 1 . Ph otograph of 1 8- MHz dock buffer chip mounted j n its 
csvity on a tinstrate. 



Fig. 2. Clock waveforms using (a) an earllBr straightfon^ard 
winng approach ar)d uslrrg (b) the present tuned tiigh- 

frequency des/gn, 
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Fig. 4. Memory read timing. 

RAMs as the X address (see Fig. 6). 

The mapper's CAM outputs drive the mapper ROM, 
which generates chip selects and three bits of Y address 
during PlA [(j}l address). The operating system must ensure 
that logical-to- physical mapping assignments are unique 
because these outputs are wired-OR lines, Simuitaneous 
matches in more than one mapper CAM can cause false 
physical addresses. An output by any mapper CAM causes a 
MY [my memory] condition to be sent to the control PLA 
and the MPB interface. An SCTL will be given on P2A [<f)2 
address) if this MY condition occurs and the control PLA 
determines that this is a memory operation. 

The chip select and mapped Y address go to the healer 
and the MAB/CS section where they go immediately to the 
RAM as the read CS and as the Y address on MAB 1 to 3 
(MAB is not used in the Y address). Bits IS to 21 from the 
original address weie delayed and are now issued on MAB 
4 to 7 to complete the Y address. 

Available on the next Pi A [state six). 39 bits of data from 
memory go to the seven decoding trees in the Irlanmiing 
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decoder. The decoder delays 32 of the bits for one clock 
phase while it generates seven syndrome bits. The syn- 
drome bits are true during P2A when they are presented to 
the data corrector along with the 32 delayed bits of read 
data- If the syndrome bits are zero^ the data corrector puts 
the delayed read data, unchanged, on the data output bus 
[DOB] during PID. 

If the syndrome bits are not zero, six of them provide a 
binary pointer for the data corrector to use to invert one of 
the read data bits. In the case of a single-bit error, the 
seventh syndrome bit [parity check of all 39 data bits) is a 
one and the bad bit is corrected. Should the parity check be 
a zero while the others indicate there is an error, a double- 
bit error has occurred. The bit inverted by the data corrector 
is neither of the error bits, so a signal (DBE) about this is 
sent to the chip's MPB interface. Error detection and correc- 
tion are accomplished in 40 ns. 

Data on the data output bus goes to the read data aligner, 
and is also delayed one state to the fast-byte bus (FBB). In 
the read data aligner, signals from the control PLA .select 
and right -justify bytes or half-words for nonword opera- 
tions ^ or pass through the whole word for word op erations. 
The read data aligner output goes to the data-iny' data- out 
[DINDO) bus, which is connected to the MPB interface^ The 
MPB interface oow places the data, the second SCTLt and 
the DBE signal [if present) on the MPB. 

Write IVIemory Operations 

Fig. 5 shows some timing for memory write operations. If 
the memory operation is a write, the control PLA directs its 
MPB interface to poll for a bus cycle during state five. This 
obtains the bus cycle needed to put the write data into 
memory. In state six, it repeats on the MPB the original 
addresSt w^hich w^as in a delay pipeline in the interface. 
Most of the read memor^^ cycle is then repeated to ac- 
complish the second half (completion) of the write. 

Of course, several things are different from read opera- 
tions. First of alL in state seven, the read data is not placed 
on the MPB, bnt rather the viTite data from the master 
processor is latched by the interface. On PlA during state 
eight, that data goes via the DINDO bus to the write data 
aligner. For nonword operations, the \%Tite data aligner 
merges the rightmost byte or half-word with the read data 
on the fast-byte bus by substituting it as specified by the 
address. In word operations, the read data is ignored and 
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the \mte data is passed through. 

The output of the uTite data aligner goes to the Hamniiiig 
encoder. The Ha mm lag encoder delays the data by one- half 
state while It generates seven check bits from it. The check 
bits are appended to the 32 data bits. Its P2A output is sent 
lo the encoded data bus. This bus goes to the memory' data 
bus section, which presents the 39 bits to the RAM as write 
data during PiD in state nine. Also during PlD in state nine, 
the PLA has the MAB/CS section repeat the chip selects to the 
RAiMs t^^Tlte est- 

Semaphore Operation 

A semaphore operation reads data from a memory^ loca- 
tion and sends it to the master processor while a minus one 
is wTitten lo that location. The master processor uses this to 
obtain control of a process. The semaphore operation fol- 
lows the read operation with a few differences. First, the 
output of the Hamming encoder is turned off. so the en- 
coded data bus and thus the uTite data on the memory data 
bus is left precharged (all ones, whicli is minus one in the 
signed integer format). Then during slate five, the control 
PLA has the MA8/CS section repeal the chip selects as \vTite 
CS. This makes the RAMs accept the minus one from the 
memory data bus as write data for that location. 

Healer Operation 

In the healer, bits 18 to 29 of the address on the AND bus 
are delayed one state, concatenated with the output of the 
mapper ROM. and presented tiurough the healer cam ad- 
dress bus (HCAB) to the healer^s CAMs (HCAMs) and to a 
pipeline that delays the bits for three states. The output of 
the HCAMs goes to the healer control PLA. A match by an 
HCAM causes a substitute memory location (an HRAM] to 
dump its contents to the HRAM output bus while the input 
to the Hamming decoder is switched from the memory data 
bus to the HRAM output bus. 

The healer has a significant effect on system reliability 
and availability. Up to 32 words per memory finstrate can 
have hard errors without either shutting down the system 
because of known memory problems (uncorrectable hard 
errors) or potential memory problems (hard single-bit errors 
increasijig the likelihood of uncorrectable errors). 

Healing on the Fly 

Healing on the ily is transparent to system performance. It 
improves system integrity by healing known memory errors 
as they are detected, without affecting the current transac- 
tion or bus bandwidth as a correction and write-back 
scheme would, II also provides a log of the error addresses, 
which is useful in the repair or replacement of a card. 

A nonzero set of syndrome bits sends a signal ERR to the 
healer. ERR causes the HCAM pointer to increment during 
the next state. As ERR comes true, the address in the HCAB 
delay pipeline is dumped on the healer's internal register 
access (HIRA) bus while the HCAM indicated by the HCAM 
pointer is set from the HIRA bus. When the HCAM pointer is 
incremented, the next address goes to the next HCAM, 
leaving the error address in the previous HCAM — thus the 
error is healed. 

Meanwhile, read data from memory is going to the HRAM 
input bus and being set into the IIRAM corresponding to 



the HCAM indicated bv the pointer. VVTien the HCAM 
pointer is incremented » the read data is similarly captured 
in the HILAM, allowing the healer to have the same data in 
its substitute memor\^ as was in the bad memory location. 
When the healer pointer count goes from 31 to 32, the healer 
is filled, a statns bit is set, and a message is sent to the 
systBui. 

Internal Register Access 

To manage the healer and mapper, the system must be 
able to access their CAMs, It must also access the MC# 
[memQr\!' controller chip number) and status registers to 
turn on the system and the trace register for the system's 
debug aid. This is done with a channel access to channel 0, 
As previously mentioned, the address on the AND bus goes 
to an internal register access (IRA) decode section. This 
section checks the MC# field of the address (bits 6 to 9) 
against the memory controller chip's MC# and signals the 
control PLA if it matches. Memor%^ controller chip IRA 
operations are handled with data going directly between 
the r egister and MPS interface. The main pathway is the 
data time (PiA) of the AND bus. The AND bus is connected 
via a multiplexer to the HIRA and mapper IRA (MIRA) 
buses in the healer and mapper. 

Refresh 

Since the NMOS RAM is dynamic, it must be refreshed. 
This is accomplished by having synchronized refresh coun- 
ters on each memory controller chip. A refresh occurs every 
16 bus cycles (32 states). The X address is changed for each 
refresh, but the CS signal is given each lime to all R.^Ms. 
The MPB address time for the refresh cycle is normally 
wasted, so it is used as the time when a memor>^ controller 
chip sends its messages to the CPU. 

Memory Management 

Also important to the system is bein^ able to map and 
unmap memory blocks or to heal and unheal HCAMs. Thus 
each mapper CAM has a MAPOIIT bit which disables that 
CAM no matter what the other contents of the CAM are. Each 
healer CAM has a HEALED bit, which when not set, dis- 
ables that heaier CAM. 

Self -Test 

The self-test section of the memory controller chip is 
almost as complex as microprocessors of six years ago. 
Occupying 5% of the chip's die area and containing about 
7000 transistors, it does a 99% confidence test on the inter- 
nal circuitry and the chip^s MPB interface. Self-test on a 
good chip completes in less than 1,5 ms. 

Self-test simulates the memorj' controller chip's MPB 
interface receiving addresses, data, and control signals 
from another processor. It does this by placing signals on 
the buses and control lines from the MPB interface to the 
internal registers, control PLA, and data manipulation sec- 
tion. Thus, the circuitry of the chip is tested as a functional 
unit rather than testing sections of circuitry separately. 

Any failure halts the selMest. If no failure occurs, a 
column-march test is done on the Rj'\M addresses control- 
led fay each CAM in the mapper. Should Hamming decod- 
ing detect Any error or the data be 1 ncorrect (as in the case of 
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Fig* 7. Block dlBgram of self -test system incorporated onto 
tt}B memofy controller chip 

an addressing failure), the mapper CAM is loaded with a 
MAPOUT condition as a message to the operating system 
that the memory is not 100% good. The memory test takes 
less than 500 ms to complete. 

Self- test then allows handshakes with the system turn- on 
procedure to test the MPB Interface and its connection to 
the memory processor bus. 



A block diagram of the self-test system in shown in Fig. 7. 
The core of this system is a 19-input, 68-output PLA with 
272 terms. Many of its outputs are sent to two 32-bit PLA- 
like pattern generators. The patterns from the test address 
generator are used as addresses or data placed on the AND 
bus, or as data compared to the AND bus. The patterns from 
the test memory data generator are used as data placed on or 
compared to the DINDO bus. 

Other terms control the counters and shift registers that 
generate patterns for the test-address or test- memory- data 
generators, and the controi counters that sequence the PLA 
through each state of each test block. 

The memory control chip self-test has no branchii^ or 
subroutine capabilities. It is strictly a sequential machine. 
Thus, the main challenge in its design and implementation 
was to do the best test available while positioning the test 
blocks in a sequence that minimized terms in the PLA. This 
sometimes required inserting NOP test blocks, or repeating 
a test several times within a test block when once would 
have been enough. 

To check the self-test implementation^ a software 
simulator was built. Tied into the chip's software emulator, 
it helped check chip functionality. The emulator was also 
helpful in ironing out the complexities of healing on the fly 
in a pipelined system. 



128K-Bit NMOS Dynamic RAM with 
Redundancy 

by John K. Wheeler, John R. Spencer, Dale R. Beucler, and Charlie G, Kohlhardt 



THE SEMICONDUCTOR random-access memory 
(RAM) chip is a basic building block of today's com- 
puter memory systems. Ideal memory chip charac- 
teristics in a high-performance multiprocessor system in- 
clude fast cycle times (large bandwidth), high number of 
bits per chip (density), low cost, and low power dissipation. 
A VLSI NMOS RAM was designed and built by Hewlett- 
Packard to optimize the above characteristics for HP's new 
32-bit computer system, the HP 9000. 

The RAM chip, whose layout is shown in Fig. 1, is com- 
posed of a large dynamic memory^ array with supporting 
peripheral circuitry on the left and bottom sides. The 
number and complexity of the peripheral circuits are 
minimized by the use of a four- transistor memory cell. The 
periormance and other characteristics of this RAM are 
listed in Table I. 

The memory array contains 128K fom-- transistor cells 
organized to store 16K 8-bit words. Eight identical 16K- 
by-l-bit sections are arranged side by side. Each section has 
256 row^s and 64 columns. In addition, each section has 



eight redundant rows located in the upper half, and two 
redundant columns placed in the center. 

The peripheral circuitry on the left side of Fig. 1 contains 
the X and Y address receivers and drivers, the row^ decoders 
and drivers, and the row^ redundancy circuitry. The X and Y 
addresses are multiplexed on eight address pads. Each ad- 
dress pad has an X address receiver, but only six of the eight 
address pads have Y address receivers. The X address is 
bused to the row decoders and row redundancy circuits via 
16 true complement lines. The Y address is bused to the 
column decoders and column redundancy circuits via 12 
true complement lines. 

The peripheral circuitry for each section along the bottom 
of Fig, 1 contains a column decoder, an I/O multiplexer, a 
sense amplifier, an output driver, a write data receiver, and 
column redundancy. The wire-bond pads located along the 
bottom include eight data 1/0 pads, several column- 
redundancy testing and progranuning pads^ and various 
powder supply and clock pads. Timing and control circuits 
are fomad in the lower left corner. 
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Tablet 
123K-8rt RAM Performance and Characteristics 



Technology: 

OperatLQg Modes: 
System Features: 



Organ izati OIL 
Memory Cell: 
Cell Size; 
Chip Size: 
Redundancy: 

Power Supplies: 
I/O Levels: 
Access Time; 
Cycle Time; 
Power Dissipation: 
Standby Power: 

Refresii: 
Package: 



NMOS, single-level polysilicon. l.S-^jtiu- 
wide lines p l-^tm spaces, two-layer metal 
Read, read/write, standby 
Synchronous timing (IS- MHz s^'stem docks) 
Pipelined architecture 
Semaphore operations 
Multipiexed pads for chip select, data, and 
addross 
IfiKxa 

Four-transistor dynamic 
10.25 fim by 20.5 ptm 
6690 /im by 7580 ^m 
B rows and 16 columns^ using electrically 
programmed polysLlicon links 
-2V. 3.6V. 4.9V, 6,5V 
Precharged bus scheme 
165 ns 

110 ns (includes read^ write) 
450 mW. typical 

125 mW, typical [address receivers always 
active) 

256 cycles, 1 ms 

Finstrate. Copper-core printed circuit 
boaid with Teflon^^ dielectric 



Memory Cell 

This dynamic RAM is somewhat novel in that it uses a 
four- transistor storage cell. Most MOS dynamic RA Ms built 
today use a one -transistor storage ceil, and most MOS static 
RAMs use either a six- transistor storage cell or a four- 
transistor storage cell with high-resistance polysilicon load 



resistors. The major characteristics of each design are sum* 
raarized in Table IL 

The desired speed and performance of the system pre- 
vented the use of the one-transistor storage cell* The high 
power dissipatioia of the six-transistor static eel! %vas pro- 
hibitive and the four- transistor static cell could not be used 
because HP's VLSI NMOS process (NMOS ini does not 
incorporate high -resistance polysilicon. 

The four- transistor dynamic cell shown in Fig. 2 provides 
high speed, low powder dissipation, and a cell size smaller 
than most static ceOs, and is compatible with the NMOS-Ill 
process. 

Functional Description 

The Ri\M chip uses a nonoverlapping. two- phase system 
clock for synchronous operation. The clock period [one 01 
pulse and one (62 pulse) is 55 ns. These pairs of c^l and 02 
pulses are further organized as data cycles and address 
cycles. Synchronization of the RAM chip with the 3 2- bit 
computer system is done via the system-pop signal at 
pow^er-up. 

The RAM chip provides three modes of operation: read, 



Table II 
Storage Cell Characteristics 



Type 

Four-Transistor Dynamic 
One-Transistor Dynamic 
Six -Transistor Static 
F'our- Transistor Static 



^ J 


Cell 


Static 




Size 


power 


Fast 


Medium 


None 


Slow 


Small 


None 


Fast 


Large 


High 


Fast 


Large 


Low 




Fig. 1 . Physical layout of 1 28K-bit 

NMOS dynamic RAM. 
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Fig. 2. Faur-lransistof dynamic storage ceiL 



in standby mode. 
Time B. All pads are precharged. 

Time C. The first Y address Yl and read chip select signal 
RCSi are received. RCSl triggers the internal operation of 
the chip, and the chip goes from standby into active mode. 
Time D. All pads are precharged, The decoded XI address 
selects one of the 256 row lines to go high and all others 
remain low. AH cells connected to this row line begin driv- 
ing differential data on their respective precharged data; 
data' pairs. 
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read/write, and standby. See the memory system article on 
page 14 for a detailed discussion of address and data cycles. 
A step-by-step description of timing periods A through G 
from the timing overview sliov^n in Fig, 3 follows. Refer to 
the block diagram of the data path in Fig. 4 to follow the 
major internal events in the RAM, 

Time A. The first X address XI is received on the address 
pads and latched into the X address receivers. No other 
internal action takes place because* before tliis, tbe chip was 
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I Drivers 






F I g, 3 . Timing overview for opera - 
tion ot 128K-bit NMOS dynamic 
RAM chip. (DC=Don't Care}. 



Time E, The write chip select and write data signals WCSi 
and WDl are received on the pads. The decoded Yl address 
coimects one of the 64 data/data' pairs in each section to the 
sense amplifier through the 1/0 multiplexer. Differential 
cell data is drivea into the sense amplifier. The next X 
address X2 is received, initiating pipelined operation. 
Time F. All pads are precharged. The sense enable clock 
signal SE isolates the sense amplifier from the I/O multi- 
plexer. The sense amplifier completes the sense operation 
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Fig. 4* Block diagram of data path for 12BK-bii NMOS dynamic RAM. Onty one of the eight 

memory sections is shown. 
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and sets up the output latch with the read data signal RDl. 
WDl is now diJTerenUally %vTitteD backtlirough tlie lO mul- 
tipJexer to the currently addressed cell. 
Time G. RDl is driven from the output latch onto the data lO 
pads. Pipelined operation continues with the reception of 
the next Y address Y2 and read chip select signal RCS2. This 
completes the read^vTite operation to the cell at address Xl, 
Yl. 

Six clock pulses occur from when the XI address appears 
on the address pads to when RDi is valid on the data pads. 
Thus the access time for this read is 165 ns, Note that theX2, 
Y2 address has been received and partially processed so 
that the read data signal ED2 will he valid four clock pulses 
after RDl. corresponding to a pipelined cycle time of 110 ns. 

This timing scheme is different from that used for most 
commercially available dynamic RAMS, whose cycle time 
is longer than their access time. 

Redundancy 

One method of increasing RAM yields, and thus reducing 
chip cost, is the addition of redundant memory cells. These 
redundant cells are used to replace defective cells, therehy 
repairing some chips that would otherwise be rejected. By 
adding extra rows and columns to the RAM array p defects of 
various types and at various levels can be repaired. 

To demonstrate the potential benefits of redundancy, a 
yield model was developed. Here, the good die (chips) per 
wafer are equal to 



^ Wafer ) ^\ 



ProlMhility of Zero 

Defects in the 
Uncorrectable Area 



) 



( 



Probability That All 

Defects in the Correctable 

Area Can Be Repaired 



) 



/ Chips 
V Wafer 



)(^exp(-DUsI)l 5] (^)(exp[-DCl)pM 



where D=dBfect density. U=uncorrec table area. S=sen- 
sitivity of uncorrectable area to deiects. R^araount of re- 
dundancy. C=correctable area, and P^prohability that a 
given defect is correctable. 

A Monte Carlo analysis was done, treating each parame- 
ter as a random variable with an assigned probability dis- 
tribution. From this analysis the optimum numbers of re- 
dundant rows and columns for the 128K'hit RAM were de- 
termined to be eight rows and sixteen columns. In addition, 
the analysis indicated that a yield improvement greater 
than 4x could be achieved. 

The four- transistor RAM architecture is well suited for 
redundancy. In the case of the 128K-bit RAM. 75% of the 
chip area is correctable. This correctable area includes not 
only the memory array, but also the I/O multiplexer, row 
drivers, and row and column decoders. Yield is limited by 
defects in the remaining chip area. However, because cir- 
cuits along the peripherj' of this chip have a low percent- 



Polysilicon Link Fusing 
and Detection Circuit 



The redundant rows and columns on HFs l28K-bit NMOS 

I dynamic RAM chip are programmed to replace defect fve rows or 

columns by fusing potysiNcon links on the chip Spectal circuitry is 

included on the chjp to do thjs and !o detect fused poly silicon 

links. This cifCUFtry is rllustrated in Rg 1 , 

When fusing polysilicon Jinks, a special power supply, 
^BLOw- '^ connecied to the fusing circuit, the link is addressed, 
and a voltage pulse is applied to the pulse pad The resulting 
curreni through the link and FET 03 fuses the link open During 
normal operation, the puJse pad and VgLOW ^^^ driven to ground 
by FETs Qt and Q2 to disable the pulse circuitry. 

To determine if a link is fused open or not. its resistance is 
compared to a pofysflfcon reference resistor. In the worst case, 
the link resisiance must be only a factor of three different from the 
reference for reliable detection. The reference resistor is de- 
signed to be about five times the resistance of an unfused link, 
regardless of process variations. This design provides higher link 
fusing yietd and greater reliabflity. 

When power is first applied, POP (power on preset) becomes 
high. POP is low, The resulting voltage at node ^ is approxi- 
mately equal to V^ (threshold voltage) if the link is intact, but is 
greater than V; if the link is open. The currents through matched 
depletion FETs Q5 and Q6 depend strongly on the difference of 
resistances of the link and the reference resistor, and thus gener- 
ate a corresponding voltage differential at nodes F, and F^ 

After system power up, POP goes low and POP goes high. The 
differential voltage between f^ and F^, is then amplified and 
the circuit latches. Complementary outputs are then present at 
nodes F^ and F^. Depletion capacitor Cl stabilizes the voltage 
Bi node 1 during the transition from POP to POP in case there is 
some deadtime or overlap between these signals. The resis- 
tances of the link and the reference become insignificant factors 
once the circuit latches. 

-Doagtas F. DeBo&r 
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Fig. 1, bnk fusing and detection circuit used on the 128K-bit 
NMOS RAM chfp 
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Fig, 5. Block diagram of redundancy system used on the 
128K-bit RAM chip. 



age of active device area and aie relatively insensitive to 
leakage-type defects, this area is higher yielding. By 
using conservative design rules, the yield of this area is 
increased even iHirther. 

A block diagram of the redundancy system is shown in 
Fig. 5. The row and column redundancies are similar in 
design, but are separate circuits within the RAM, The oper- 



ation is as follows. An incoming row or column address is 
compared to the addresses stored in the preprogrammed 
address registers. If a match occurs on any of the address 
comparators, the corresponding redundant row^ or column 
is enabled. In addition, the deselect circuitry is activated, 
which disables the nonredundant raws or columns. Re- 
dundant rows and cohimns are identical to other rows and 
coiumns/rhey share the same I/O data path, timing, storage 
cell pitch, and layout. The only exception is that the address 
decoders are replaced by address comparators for the re- 
dundant rows and columns* 

The programmable address registers contain polysiilcon 
links which are electrically programmed during wafer test- 
ing. During normal operation, the link resistance is com- 
pared to a reference poly silicon resistor through a com- 
parator circuit (see box on page 1^3), This circuit provides 
both the true and complement outputs of the programmed 
address to the address comparator circuits. 

Because of additional delays through the address com- 
parators and disable circuitry > the disable signal becomes 
true after the normal address decoding is complete. So that 
redundancy does not degrade chip performance, the dis- 
able signal deactivates the final stages of normal row and 
column selection rather than disabling their decoders. 



Finstrate: A New Concept in VLSI Packaging 

Finstrate combines a copper fin for heat conduction and 
dissipation witii a mu it i layer substrate for tow-capacitance 
Interconnection between ICs. 

by Arun K, Malhotra, Glen E. Leinbach, Jeffery J. Straw, and Guy R. Wagner 



EVEN THOUGH HP'S NMOS in technology has low^ 
power dissipation per gate^ it also allows an IC de- 
signer to pack more than half a million transistors 
onto a single chip. The result is an average power density of 
20 watts per square centimeter and power output up to 5 
watts per chip. This degree of miniaturization also results 
in circuits with a large number of interconnection pads and 
high clock speeds. A 3 2 -bit I/O processor chip, for example, 
has 122 pads and operates at 18 MHz. 

Early in the design of the chip set for a 32-bit \T.SI com- 
puter system, it became ob%ious that the speed, intercon- 
nect t and cooling requirements of the system could not be 
met by established packaging methods. An insulating 
material wulh a lou^ dielectric constant is necessary to 
minimise line capacitance for high-speed operation. Fine 
line traces are needed for the dense interconnect pattern 
surrounding the ICs. The high power dissipation of the 
chips results in unacceptable junction temperatures with- 
out good keat dissipation. 



The finstrate (fin-substrate) board was developed to meet 
these needs. It has a solid copper core, nses Teflon for the 
dielectric, and provides 0.125-mm-wide traces spaced 
0.125 nun apart. 

Fabrication 

An array of finstrates begins as a single copper sheet. Tins 
sheet, roughly 10 mm thick, forms the heat dissipation path 
and electrical backgate connection at the center of each 
finstrate. Holes are drilled through this sheet to provide 
openings for insulated electrical connections between the 
outer layers of a completed finstrate. Following a surface 
treatment operation, a sheet of Teflon and a copper foil are 
laminated onto each side of the copper sheet. At this stage, 
the Tefl on f 1 1 Is t he dri 1 led hoi es ment \ oned a bo ve as shown 
in Fig . 1 a . The copper foil is converted to i ntermed iate4ayer 
circuits by a print-and-etch process, a technique borrowed 
from conventional printed circuit technology. 

Tellon -s a reg4tere-3 sraaemark &■ t^.e DuPont Corporaiton 
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Fl0. 1, Cross sections during the finstrate fabrication se- 
quence- (a) After lamination of the intBrmedfate copper foil 
layer (b) After definittor} of the sntermediate iayer arcutt pat- 
tern and subsequent outer copper fofi layer lamination, (c) 
After driiling and cavity milling steps, (d) Completed frnstrate. 

The finstrate panels grow thicker (Fig, lb) as a second 
layer of Teflon and another copper foil are laminated on 
each side. Interconnect features between the inteiniediate 
and outer copper foil circuit layers are defined next by a 
selective etch process. Blind holes (vias) connect the 
intennediate layer with the associated outer layer on 
each side of the copper core. 

Next, cavities are milled through the laminated layers to 
the copper core at the locations where the ICs are to be 
attached. A second drilling operation also performed at this 
time serves two purposes. Relatively small bits drill 
through the center of the Teflon material filling the large 
holes throQgh the copper core to form holes for pla ted- 
through connections. The other holes drilled in this opera- 
tion contact the copper core to create outer- layer- to-core 
connections. i*'ig. Ic demonstrates these features. Follow- 
ing plating operations to build a conductive base coating 
o%'er the entire panel surfacen circuits are defined by a 
photoresist masking process which leaves the desired cir- 
cuit pattern exposed . El ectroplated copper , nic kel . and gol d 
increase the thickness of the exposed pattern. A farther 
selective electroplating step leaves a high-purity gold layer 
on the edge connector fingers, wire bond pads^ and chip 
cavities* All copper toil remaining between traces is etched 



away after the photoresist masks are stripped (the gold iayer 
protects the desired circuit pattern). A blanking operation 
separates indi\idoal £in.strates from the panel, and electri- 
cal and visual tests complete the finstrate fabrication. A 
completed finstrate, as shown in Fig. Id, has a copper core, 
two copper-foil interconnect layers on each side of the core. 
Teflon as the dielectric, and gold-plated circuit patterns. 

IC Assembly 

Assembly of a large hybrid circuit ( 124 x 181 mm) with 22 
integrated circuits, 92 passive components, and over 800 
wire bonds is a challenge in itself. The refractory metalliza- 
tion used on the ICs and the finstrate's dielectric add even 
more constraints to the assembly process. 

Finstrates are first mechanically scrubbed and rinsed in 
deionized w ater. This operation is essential for gold- wire 
bonding on finstrates. Chip capacitors are surface mounted 
using silver-filled epoxy. After a curing operation, a test is 
performed to check for epoxy bridging and to verify com- 
ponent values. Components are then coated with a noncon- 
ducting epoxy for protection from humidity. 

The ICs for the finstrate are picked up with a vacuum 
collet and placed in the milled cavities which have under- 
gone an epoxy^ stamping operation. The silver- filled epoxy, 
which is cured at 150°C, makes a good electrical and ther- 
mal connection betw^eea the finstrate and the IC. Special 
precautions are taken so that the ICs top surface is not 
touched when handling the chips. This minimizes mechan- 
ical damage and enhances the assembly yield. 

The IC pads are electrically connected to the finstrate 
with 3 8'^ m- diameter gold wires. Placing 4'mm-long wires 
on 0,16-mm centers using an automatic thermosonic wire 
bonder requires tight controls over the bonding process. 
Softening of the Teflon on the finstrates prevents the use of 
bonding temperatures greater than lOO^C, The use of 
aluminum pads over silicon nitride on the IC and an extra 
ball bond over the tail bond to the finstrate gives the best 
results (see Fig. 2), After wire bonding, the fCs are coated 
with a polymer for alpha particle protection. Slain! ess- steel 




Fig. 2. Wire bonds from the I C to the finstrate use an extra ball 
bond over the tail bond as shown 
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Fig. 3. The HP 9000 Computer uses HP's new VLSf and 
finstrate technoiogieB to provide a desktop-size workstatfon 
tow enough m price to let professionai personnet have their 
own personBt 32-bit rnainframes. 

lids over the IC cavities are then attached to the finstrates 
with a nonconducting epoxy to pro\nde mechanical protec- 
tion for the ICs and the chip capacitors. After electrical tests 
and a burn-in cycle, the finstrates are completed. 



Bl.S'^C for the clock buffer chip, vi^hich was subsequently 
experimentally verified to be correct within ±2*C. By 
proper finstrate design it was found that under the worst- 
case operating conditions of 4572 meters altitude and 55°C 
ambient air temperature, no processor chip exceeded the 
maximum allowable junction temperature of 90^C. 

For the clock speed and signal rise-time requirements of 
the HP 9000 processing system, special consideration of the 
electrical performance of packaged components was re- 
quired. At the finstrate level, microstrip analysis of critical 
features was done. The choice of Teflon with Jts very low 
dielectric constant of 2.1 significantly reduces capacitive 
coupling when compared with other typical dielectric 
materials. This generally allows increased speed for a given 
output driver power leveL Calculations also helped select 
appropriate trace shapes and sizes for the various intercon- 
nect requirements. 

Acknowledgments 

The authors would like to acknowledge the help from 
Walt lohnson and HP's Loveland Division printed circuit 
shop — without their support, this program would not have 
succeeded. 



Memory/ Processor Module 

Three types of finstrates (see Fig- 5 on page 6) are used in 
the HP 9000 Computer [Fig, 3). The CPU finstrate houses 
the CPU chip and a clock buffer chip. The 10 P finstrate 
holds the 1/0 processor chip and a clock buffer chip and is 
connected to a printed circuit board containing TTL buf- 
fers. The 256K'byte memory finstrate contains twenty 
128K-hit RAMs. a memoi>^ controller chip, and a clock 
buffer chip. All finstrates are housed in an enclosure called 
the Memory/ Process or Module {see Fig, 4 on page 5). Fin- 
strates in this module are located on ll,5-mm centers and 
are connected to a motherboard via edge connectors. A 
one-hundred-pin bus connects the finstrates together. In 
the center is the 32-bit memory processor bus (MPB) with 
interlaced ground traces and active termination provided 
by all inactive drivers. The system clock is routed to all 
finstrates simultaneously along traces of matched length. 
Self- test connections are included in the control signal por- 
tion of the bus, and power supplies use the remaining bus 
pins. A dc fan, chosen so that air velocity can be controlled 
as a function of ambient air temperature, is located on the 
end of the module to move cooling air across the finstrates. 

Finstrate and Module Design 

A considerable amount of time was spent on thermal 
analysis of the finstrates and the results were used to 
minimize the chip junction temperatures. A program that 
solves tiie steady-state Poisson heat conduction equation by 
using a finite-difference approximation was written for an 
HP 9845 Computer. Since air is used to cool the finstrate, a 
nodal network representing that moving fluid was added to 
the program. Fig. 4 shows a calculated result for the RAM 
finstrate. The program predicted a junction temperature of 
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Fig. 4. Thrae-dimensiona! (a) and surface contour (b) tem- 
perature plots calculated for the RAM finstrate at an attitude of 
4572 meters and an ambient temperature of 55°C. 
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NMOS-III Process Technology 

by Jamas M. MJkkefson, Fung-Sun Fef, Arun K. Mafhotra, and S. Dana Seccombe 



THE MAJOR TECHNOLOGICAL INNOVATION re- 
qyiied for tha design aod manufacture of the 3 2- bit 
HP 9000 Computer System was the development of 
NMOS m. a high'densityt high-speed IC process. This 
eight-mask, n-channeL sOi con- gate process uses optical 
lithography to print minimum features of 1.5-^m-wide 
lines and l.O-^m spaces on all critical levels. Both en- 
haocement and depletion devices are available. The de- 
vices are fabricated with 40- nm- thick gate oxides and shal- 
low implanted sources and drains to reduce short-channel 
effects. Major departures from conventional MOS processes 
include external contacts * to gates, drains, and sources, and 
two layers of refractor^'' metaUizalion for interconnecting 
devices. 

Design Considerations 

Significantly improved circuit performance can be ob- 
tained by reducing [scaling) the size of the geometrical 
features of an integrated circuit. Scaling provides increased 
speed and reduced power consumption for a given electri- 
cal function, and at the same time, it allows the fabrication 
of a greater number of circuits on a given silicon chip area. 
This increased packing density reduces cost and improves 
reliability of an electronic system. 

But, to build a 3 2- bit VLSI computer system with the 
feature sizes used in NMOS III, simple scaling of conven- 
tional circuits or processes is not practical because several 
physical effects become significant circuit and fabrication 
limitations. Some important device limitations, such as 
electron velocity saturation, fringing capacitance, sub- 
threshold current, substrate bias effects, and device varia- 
tions caused by fabrication tolerances, must be properly 
modeled in the design rules and circuit simulations before a 
32-bil VLSI chip can be designed successfully. 

The importance of geometrical control is illustrated in 
Fig, 1. The variation of threshold voifage as a function of 
channel length is shown. The change in threshold voltage 
of a t.5-Mm-channei-length device caused by a photolitho- 
graphic llnewidth variation of ±0.25 ^m isO.lOV, which is 
about 10%, This change, in conjunction with a channel- 
length change of L 4 to 1, causes an output current variation 
for the device of L6 to 1* These variations must be included 
in the worst-case design of circuits. 

Circuit failure mechanisms caused by mobile ions^ elec- 
tron injection into the gate oxide, and metal electromigra- 
tion or corrosion must be avoided. New process techniques 
are required to minimize operating margin loss and speed 
reduction caused by high- re si stance interconnection be- 
tween individual devices. In addition to addressing these 
problems, it was clear that to increase packing density sig- 
nificantly, new interconnection and contact techniques had 
to be developed. 
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Fig, 1- Effect of ohannet iength on threshotd voltsge. 

The failure mechanisms are avoided by using clean pro- 
cessing techniques to prevent mobile ion contamination, 
minimizing electron trapping in the oxide, Umiting the 
voltages applied to short-channel devices to prevent 
threshold shifts caused by electron injection, and using 
refractor\^ metal to increase resistance to electromigration 
and corrosion. 

Because two layers of refractory metal are provided, 
high-resistance polysiHcon is not needed to interconnect 
devices over any significant distance. Besides eliminating 
the RC delays typically associated with high-resistance 
poly silicon interconnections, the two layers of metal help 
solve many of the topological problems associated with 
interconnecting a very large number of devices. The ability 
to run low-resistance interconnections in two directions 
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Fig, 2. By using externa! contact structures (a), the area for a 
minimum device structure can be greatly reduced as com- 
pared with using inter nai contact structures (b). 
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reduces the area of many circuits by a factor of two and 
significantly simplifies the design process. 

In a typical MOS process, large areas are devoted to con- 
tacting the metallization to the gates, sources, and drains of 
the devices. Special processing tecliniques developed for 
NMOS III allow the use of external contacts as illustrated in 
Fig. 2. By allowing direct metal contact to the gate electrode 
over the gate oxide, and by allowing similar contacts to the 
source and drain, an area savings of up to 60% can be 
achieved- 

Process Description 

Fabrication begins with high- resistivity (20 fl^cm) p-type 
substrates. After the growth of a 20-nin-thick thermal oxide 



buffer layer, a 160-mn4hick layer of silicon nitride (SijNJ 
is deposited. The field oxide areas are patterned and the 
nitride and oxide layers are etched. The exposed silicon is 
anisotropically etched with potassium hydroxide and the 
field regions are implanted to provide a high parasitic 
threshold. A fully recessed field oxide is grown to a thick- 
ness of 600 nm using the nitride layer as a local oxidation 
mask. 

The nitride and the oxide buffer layer are removed and all 
exposed silicon is implanted with boron to a depth of 0.3 
fjLTn and an average dopiJig of 3xiO^^/cm^. The surface is 
masked and areas are opened for the depletion load im- 
plant- After these areas are implanted with phosphorus to a 
depth of 0.15 fim and an average doping density of 



Polysilicon Link Design 



The NMOS-in RAM required ttie development ot an on-ctiip 
polysilicon fink for the redundancy circuitry. Correctly designing 
this link required characterization of the physical fusing 
mechanisms, thermal properties, and electrical behavior of 
polysilicon. The resulting link can be electrically fused in a few 
mtcroseconds with less than 200 mW. This link is shown m Rg. 1 . 

The electrical and thermal properties of polysilicon vary greatly 
with temperature as the link is heated to melting. For this reason, 
an electncal analog of the link's thermaJ charactenstics was de- 
veloped and simulated with the circuit analysis program 
HPSPICE. This model accurately predicts the vo!t age- current 




Fig, 1. Micfophotograph of a poiysiticon link before deposl- 
tton of CVD oxide and meiaifization. CrosB-sectionat v/ew A- A 
ts shown ifi Rg. 2 with CVD oxide and metaffization added. 



charactensiics and thermal profiles for various Itnks. Based on 
simulations, a link geometry was chosen that requires only low 
voltage and low power for fusing. 

Simulation was also valuable in detennining thermai profiles at 
the polysilicon- to- metal contacts found at each end of the link, h 
was discovered that the polysilicon could melt and cause the fuse 
gap to occur directly underneath the metal contact. This is highly 
undesiraoie from both a reliability and manufacturing yield 
standpoint. To control the temperature profile along the link, a 
contactfng scheme was developed as shown in Rg, 2. At one end, 
the polysilicon makes contact with a diffused silicon region to 
create a heat sink to the substrate. This cools the region under the 
positive metal contact, forcing melting to occur only over field 
oxide and not near the contacting metal. 

The physical mechanism of link fusing is by migration of ionized 
silicon atoms from the positive to the negattve terminals of the link. 
This is why only the positive end of the link is connected to a 
diffusion for heat sinking. The fuse gap then occurs just beyond 
the diffusion contact This mechanism was investigated and ver- 
ified by cross-sectioning fused links and examining them with a 
scanning electron microscope. 

The link's high reliability is attributed to the IC's top layer of 
insulating oxide softening and flowing into the fuse gap. The 
integrity of the IC's passivation layer is unaffected by fusing since 
the fusing power is so low. A guard ring was added during the 
design as an additfonal safeguard. 
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fig* 2, Cross section of pofysiitcon iink structure showing 
metaf contacts and fusing focatiorj. 
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Automated Parameter Testing 



With a p/ocass syc^i as NMOS lit that involves approximately 
300 fadricatbn steps, it is essential mat a performance measute- 
ment link be established befkween the process and {fm functional- 
ity of a chip such as the 32'bit CPU. This link should measure 
important parameters affecting the performance of a VLSI chip 
such as threshold voltage V^. drain-to-source current Ips versus 
dfain-to-source voltage V^g. and punchthrough volfage. The link 
should also indicate to the process engineers the steps lo be 
GontfclJed, including such parameters as oxide thicknesses, 
linev^Fdths. and sheet resistances In the early stages of process 
development, this link could be used as a process charactenza* 
tion and monitor tooJ, if the devices to be measured are included 
on wafers with the circuit chips (CPUs. RAMs, etc), the Jmk could 
aiso be used to screen wafers before functional testing. HPs 
System Technology Operation's implementation of this link is an 
automatic paranneter tester coLfpted with a test section on each 
chip that enables 130 parameters to be tested in 2y2 minutes, 

The tester is composed of a variety ot instruments, including 
four stimulus measurement units (SMUs) that are operated in a 
force- voltage/ measure- current mode or f ore e-currenfm easu re- 
voltage mode, a high-yoltage (±100V) power supply, a capaci- 
tance meter, an electrometer for low-current measurements, an 
HP3455A Multimeter, and an HP3437A High-Speed Voltmeter. 
All of these instruments are controlled by an HP 9845 Computer. 
The instruments are multiplexed through a 58- pin test hea6 to an 
automatic prober with the wafer under test. Also included are an 
HP 7906 Disc Drive for temporary data storage and an RS-232- 
C/V,24-to-Factorv-Data-Link interface for off-line data manipula- 
tion, The SMUs have a slew rate of O.SW/bts at the end of any test 
pin. One reason for this remarkable performance is that aJI signal 
wires are guarded and driven by separate circuits throughout the 
test system. A block diagram is shown in Fig 1 

In present production procedures, an operator loads a cassette 
of wafers onto the automatic prober, inputs the device type to the 
HP 9845, and the system takes over, automaticaJly aligning, prob- 
ing, testing, analyzing data, and then transmitt^nQ the results. 

Many different device and parameter test patterns can be 
tested with this system, but one was specifically designed for the 
complete prooess/circurt monitoring mentioned above. It has 
evolved to include 14 gate-cxide FETs, including enhancement 
and depletion rnode devices with various channel widths and 
lengths, two field-oxide FETs, devices for measuring lateral and 
vefticai open-circuits and short-circuits at the poly silicon, first 
metal, and second metal layers^ devices for measuring pplysili* 
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Fig, 1 , Biock diagmm of the automated wafer parameter test 
system used for the NMOS-fii process. 

con and diffusion sheet resistances, and ten capacitors, The 
normal production flow involves sampling afl test devices on five 
chips per wafer under various bias conditions and getting a 
real-time thermal priniout of the results. 

Options include making wafer maps of particular parameters 
and I Qs versus Vp^ plots for specif ic FETs, This data is used to sort 
the waters for functional testing. A one- to- two- page summary of 
all the data is also generated. 

Ack now I edg me n ts 

The test system was partiaily designed and constructed by 
Tracy Ireland, and the graphics for the summary were pro- 
grammed by Richard Bettger. 

-Fredrick P. La Master 
-0. Dougtas Fogg 



1.5xl0^^/cm'^, the gate oxide is grown to a thickness of 40 
nm, A cross section of the device structure at this point is 
shown in Fig, 3a on page 30. 

The gate oxide Is patterned and etched to expose areas for 
diffusions, LFCV^D (low-pressure, chemical' vapor deposi- 
tion) polysilicon is deposited and doped with phosphorus 
by a phosphine diffusion. During polysilicon dopitig, the 
diffused regions, which are 0.6 ^m deep, are also generated 
by the diffusion of phosphorus through the polysilicon. The 
structure after polysilicon doping is shown in Fig. 3b. 

A layer of Si3N4 is deposited on the polysilicon for use as 
an oxidation mask at a later step. But first, the nitride is 
patterned and etched for use as an etch mask for the 
polysilicon. The polysilicon is etched and the source and 



drain regions are implanted through the overlapping gate 
oxide with phosphorus to a depth of 0.3 /jtm. Burled con- 
tacts are formed at the same time as the polysilicon pattern. 
Fig. 3c shows the structure after source-drain implantation. 
Next, the edges of the polysilicon features and the areas 
over implanted and diffused regions are oxidized using the 
Si3N4 layer as an oxidation mask to protect the polysilicon 
features. The nitride is removed and a layer of phos- 
phorus-doped silicon dioxide is deposited by chemical 
vapor deposition. Self-aligned contact holes with 
minimum areas of L5 ^m by 1.5 ^m are defined and wet 
etched through the deposited oxide, making use of the 
different etch rates for phosphorus-doped oxide and ther- 
mal oxide. The first layer of metal [4D0-mn-ttiick tungsten] 
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is deposited and patterned, leading to the structure shown 
in Fig. 3d, 

Another layer of oxide is deposited to act as the insulator 
between the two layers of melaL After the definition and 
etching of this oxide layer to form contact holes between the 
metal layers, the l,8-/im-thick second layer of metal 
f LPCVD tungsten) is deposited* patterned * and etched with 



Fig. 3, Cross sections of the de- 
vice structure during various steps 
of tne NfytOSm process, (a) After 
gate oxidation, (b) After poly sili- 
con deposition and doping, (c) 
After poly silicon patterning and 
source-drain imptantation. (d) 
Afler first-fayer metal patterning. 
(e) Completed device structure^ 



typically 5-ftm-wide lines and 3-^m spaces. The completed 
device structure is shown in Fig. 3e. 



Two-Layer Refractory Metal IC Process 

by James P. Roland, Norman E. Hendnckson, Daniel D. Kessler, Donald E. Novy Jr., and David W. Quint 



THE ABILITY TO FABRICATE 500,000 devices on a 
single integrated clmuit chip presents severe topo- 
logical puzzles in interconnecting them. This task is 
further comp Heated by speed considerations that prohibit 
the use of a relatively high-resistance polysilicon layer for 
connections over any significant distance. Thus two layers 
of low- resistance interconnect are necessary for the practi- 
cal design and operation of circuits using the NMOS-III 
technology. These two metal layers are constrained by the 
design rules shown in Table L 

The heavy emphasis on reducing device dimensions 
[scaling) affects not only the width of the metal lines, but 
also the material chosen and the processing used. Even 
though the total current through a minimum-dimension 



metal interconnect line is small in absolute terms, the cur- 
rent density in these lines is on the order of one million 
amperes per square centimeter because of their small 
cross- sectional area. This high current density can lead to 
electro migration failure.^ 

Because of its low resistivity^ and easy processing, 
aluminum is the most commonly used metal for integrated 
circuits. However, using typical values for current density 
and the elevated operating temperature (up to 90°C1 of 
NMOS'lIl chips, electromigration calculations for 
aluminum predict a mean*time-bef ore-failure on the order 
of weeks. The modeling of electromigration in tungsten is 
incomplete, but initial tests place the electromigration re- 
sistance of tungsten at about 1000 times that of copper- 
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Table I 
NMOS-llf Metal Interconnect Design Rules 

Oxide 450-imi'tliick aUicon dioxide 

l,^fjLmxi,B*fjLm mlniiEiom contact area. 
zero overlap to pol>^ilicon« zaio 
overlap of Erst metal layer 



First Metal Layer 
Intermediate Oxide 



Second Metal Lav^r 



Lifetinie 



1.5*^in3*wlde lineil.Q-iJLm. space 
0.4 ohm^square sheet reslataiice 

5 50- tun-thick silicon dioxide 
1.5-f^mx2.0-fi£D minimum contact area, 

zero overlap to first metal layer 
2.0-;im overlap of second metaJ layer 

to via, 

5.D-ftm'Wide ilne/3,0-ptm space 
0.04 ohm/square sheet resistance 

Median lifetime ^10^ hours at S5t: 



doped aluminum. 1 For aluminum to function reliably in the 
NMOS-HI process, its cross-sectional dimensions would 
have to exceed the high-densiU^ design rule specifications 
given in Table L 

As an example of this effect. Fig. 1 shows a large, 
copper-doped aluminum line connected to a small tungsten 
line after both were subjected to a high current density at an 
elevated temperature. The aluminum line is seen to have 
developed voids and hillocks, and exhibits some signs of 
melting. The tungsten line is unchanged. f3ecause of elec- 
tromigration considerations* as well as its etchability and 
chemical resistance, tungsten was chosen as the intercon- 



nect metal for NMOS EI. 

Pro^ss Description 

The choice of tungsten for integrated circuit mterconnec- 
tions is a major deviation from established practice and 
knowm mature technology. Furtliermore, the dini ens ions 
and tolerances requLi^d by the NMOS-HI process prevent 
the use of wet etching and demand that dry etching be 
used.* In addition, it has been repeatedly demonstrated that 
worst-case situations leading to yield loss coincide with 
surface variations of some sort. Therefore, care had to be 
taken in selecting the metal interconnect process sequence 
to control the shape of most features, avoid overhangs, and 
keep the siu-face as planar as possible. 

The oxide below the first metal layer is deposited by an 
atmospheric C\^D (chemical vapor deposition) process and 
doped with phosphorus for gettering mobile ions and to 
allow reflow.* The second oxide layer between the two 
metal layers is applied in a similar fashion, but is not re- 
flowed. Oxide removal is accomplished in a plasma etcher 
designed to have a high level of vertical ion bombardment, 
which ahows high and uniform etch rates. Results of etch- 
ing the second oxide layer are shown in Fig, 2, 

The deposition of tungsten can be accomplished either by 
a sputtering or a chemical vapor deposition process. Stress 
and conductivity in the deposited films are important 
parameters for a successful process. Because the first metal 
layer makes direct contact with polycrystalline silicon, and 
since the vias [vertical connections between interconnect 
layers] are of the zero-overlap type, silicon material is ex- 
posed to first-layer metal etch conditions at all vias. Under 
all plasma conditions tested, silicon etches considerably 
faster than tungsten. Therefore, the first metal layer is de- 
posited with a 30-nm-thick etch-stop material under the 
400-nni-thick layer of tungsten. The second metal layer is 

' Thw p toc&ss m which rbe qxid^ l3 haated to & tamperaitjre whsfe \t begins to soften and thusi 

fftMS slightly \o cover the undf liying stitfaca tojx>gf:aphy more evenly 




Fig. 1 , Microphotograph showmg etectromigratfon occurring 
in a wide aiuminum tine connBcted to an unaffected nBtrow 
tungsten fine carrying the same current. 



Fig. 2. Microphotograph of a ptasma-etched via between the 
first and second metat iayers. 
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Fig, 3, Microphotograph of NMOS- 
Ul first metal layer showmg con- 
tacts to the underiying polysilicon 
layer (vertical lines in figure). 



1.8-^m-thick tungsten deposited by a low-pressure C\'D 
process - 

Etching of the tungsten layers is done in a parallel-plate 
plasma etcher using a fluorine-based gas composition. 
Worst-case conditions for metal shorts and opens were de- 
termined, and margins for nnderetching and overetching 
were established with regard to these limits. Far instance, 
overetching failures in the first melal layer are often caused 
by a notch that occurs over certain features. Hence, the 
margin for overetching the first melal layer is determined 
by monitoring the area over this notch. A large-area test 



mask, a defect-density test mask, and run wafer data and 
analysis were used to define worst -case situations. Scan- 
ning electron microscope photographs of typical com- 
pleted NM OS-HI metal interconnects are shown in Fig. 3 
and Fig. 4. 
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Fig. 4. Microphotograpn stiowing 
coverage of second metal layer 
over the first metal layer for the 
NMOS-in process. 
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Defect Control for Yield Improvement 



Defect control IS a comprehensive term at HPs Systems 
TechrTology Operaibn (STO). ft means redocsng defects intro- 
duced by faulty etch fng. particles, or other process- induced prob- 
lems. The defect control plan begins with understanding the 
faifyre mechanisms on real VLSI chips, (e.g.. the 128K-bil RAM). 
Production runs are analyzed by subdividing each run into dead 
wafers, wafers with zones of defects, and random detects. Each 
category is extensively studied to determine the exact physical 
failure mechanism. Once the major process problems are found, 
the process engineers develop an improved process and usuafly 
establish a monitor for future control of this variable. 

One typical exampEe of comprehensive defect control relates to 
panicles in the metal deposition system, The problems were 
traced to grinding chain mechanisms and a "big wind" affect 
during venting of the vacuum chamber to atmospheric pressure. 
Rg. 1 shows one of the many statistical control charts before and 
alter the machfne was improved. 

Another example is the control of crystal defects. A densely 
packed ISBK-bit RAM chip requires careful processing to avoid 
refresh problems. Refresh errors were found on the early RAMs 
and the problem was found to be junction leakage. Further inves- 
tigations by STO and HP Laboratories personnel showed that the 
leakage was related to oxygen precipttation, Rg, 2 shows a RAM 
that was angle lapped, and Wright-elch decorated. The defects 
delineated are oxygen precipitates Working with silicon vendors, 
the team solved the problem. 

Aok no wl edg ments 

Thanks to Jtm Bames, Fred LaMaster, Rick Luebs. Tony Gad- 
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Fig. 2, Microphotograph of angie-lapped and Wrighi-etched 
cross section of an early 128K-bit RAM wafer showing heavy 
concentration of oxygen precipitates that contributed to ex- 
cess junction leakage. 



m 

LU 
U 

M 

H 

(T 
C- 



700 

G3B 

560 
490 
420 
350 
280 
210 
140 
70 




PRRTICLES: METRL SYSTEM #1 







I 

:' 1 



=i- 



4---f^V -^p^M 



_LZ 



_!_: 



u_ 



(j)S(s-^njf\ir\jrv(SG3m^^tncDCDG3-^ 
njmfiomC3G]SG)-^njfMajuirunjnjmm 

TvrvrvNCDODCDCDCDaOCDCDQOQaOOOOCDCO 



Fig, 1 . Control chart showing the 

reduction m particle count over 
a period of time as a metal depo- 
sition process was improved. 
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NMOS-III Photolithography 

by Howard E, Abraham, Keith G. Bartlett, Gary L. Hillis, Mark Stolz, and Martin S. Wilson 



FROM EARLY FEASIBILITY STUDIES it was clear 
that the NMOS-III process would require re vol u* 
tionary photolithography methods to produce chips 
in large volume. At that time, contemporary production 
processes achieved a minimum feature size of around 4 fj^rn 
and level-to-level alignment within 0^75 ^m. The corre- 
sponding feature size for the proposed MMOS-III process 
was to be 1 /im with ±0.25 fjnu alignment accuracy. 

Early work using an optical aligner demonstrated that 
optical lithography could meet the requirements. At about 
the same time, the first step-and-repeat aligner with mass 
production capability was introduced to the marketplace. It 
was decided to use this machine for NMOS-III production. 
Initial development work was done using conventional 
photoresist processes, but it later became clear that the 
standard process lacked the necessary^ control for some 
levels because of exposure Interactions with the substrate, 
A new multilayer photoresist process was developed to 
eliminate this problem. 
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Fig. 2. Two-iayer resist process, (a) After appHcation of 
PMMA bottom fayer and positive- resist tap layer, (bj After 
exposure and devefopmer}t of the top layer (c) Flood expo- 
sure of bottom layer by deep-UV tight (d) After development of 

PMMA layer. 



Exposure System 

The step-and-repeat optical aligner, shown schematically 
in Fig. 1. is the heart of the NMOS-III photolithography 
process. The light source consists of a HgXe bulb and an 
optical system for collecting, coUimatingj and filtering 
its output. Only radiation with a wavelength at the mercury 
G-line (436 nm) is used- A sensor measures the light output 
and provides information to a feedback system that adjusts 
exposure time to compensate for bulb aging and other varia- 
tions. When the shutter is open, the light passes through 
the reticle, a glass plate with a lOx chrome circuit 
pattern. The reticle is precisely aligned to the optical col- 
umn by using a dedicated microscope. The high- resolution 
reduction lens (numerical aperture = 0,28) projects a re- 
duced image of the reticle pattern onto the photoresist- 
coated wafer. The wafer undergoing exposure is prealigned 
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off-axis by the automatic alignment system and then moved 
into position under the lens. The entire wafer is exposed 
serially, in most cases one die at a time, under the control of 
the system computer. Positional feedback is derived from a 
laser interferometer system. Because the reduction lens has 
a shallow depth of focus (:!:2 /xm), each image must be 
individually focused by the automatic focus system which 
uses reOected infrared light. Once the mask stepping is 
properly set up, the entire sequence proceeds automati- 
cally, cassette-to-cassette, without operator intervention, 
Because the system requires ver}^ uniform temperature for 
precision alignment and image accuracy, the aligner is en- 
closed in its own environmental chamber where tempera- 
ture is controlled to ±0.1''C. 

Production control of the system is accomplished 
through an HP 9835 Desktop Computer. The 9835 is inter- 
faced to the system's computer* which contains the master 
operating program. The production control program con- 
sists of approximately 4000 lines of BASIC code %vhich 
provide a variety of data collecting and control functions. 
Among the more important are: 

r The operator oversees the aligner's operation via the 
9835 Computer. The system provides prompts and in- 
structions to minimize operating complexity. 
H! Data files are maintained for each circuit and mask level. 
This data tells the system what stepping pattern to follow 
and what alignment offsets to apply. 
m The system gathers and retains data for production con- 
trol. For example, setup and run times, run identifica- 
tion, and focus and exposure settings are automatically 
recorded . Prealignment and alignment performance data 
is retained for engineering analysis, 
a Using X and Y alignment data for each run level, the 
system automatically corrects wafer rotational error by 
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applyijig appropriate offsets to the stepping pattern. 
The system monitors optical column Z-axis movement to 
pro\ide warning of potential poor focus caused by in- 
adequate wafer flatness. This problem can arise If a parti' 
cle becomes lodged under part of the wafer. 
The wafer alignment system Is capable of aligning each 
die individually. However, to improve throughput, the 
wafers are aligned globally and the pattern is stepped with- 
out aligning ev&ry die. Alignment performance with this 
approach is ±0.20 /u.m {2cr)> The site aligmnent system uses 
a laser beam to illuminate a Fresnel-zone target on the 
wafer. The incident light is focused by the target and im- 
aged through an optical system where detectors are used to 
determine the target's location. Position information is fed 
back to the controller which, with the aid of the inter- 
ferometer system, accomplishes the final alignment. 

The reticles used in the system are produced by electron 
beam lithography. Only electron beam generation can meet 
the reticle Hnewidth and runout control requirements. 
Also, the speed of electron beam generation is necessary 
because of the tremendous complexity of the chips used in 
HP's new 32-bit VLSI computer system. In some cases the 
pattern is constructed of over 3 million rectangles. The 
reticles are protected from contamination dust by pellicles 
as described in the box on page 36. 

Pholoneslst Pro^ss 

The mask alignment system selectively exposes a thin 
spun-on layer of photosensitive material (positix^^e photo- 
resist]. Areas exposed to light are rendered soluble in a 
developer solution. After the photoresist pattern is de- 
veloped t it becomes a mask for subsequent etching or Ion 
implantation processes. It is extremely important that 
photoresist linewidths be well controlled, within :tO/l jum 
for some NMOS-lII mask levels. In the beginning, a severe 
problem in linewidth control was encountered as a result of 
light energy reflected from the underlying wafer surface. 
Since the exposing light is monochromatic, standing wave 
patterns exist in the resist because of i nterference between 
incoming and reflected wave fronts. Because of the stand- 
ing wave, the amount of light energy coupled into the resist 
(i.e., the exposure dose] is a strong function of film 
thicknesses and substrate reflectivity- It was common to 
find that photoresist lines passing over a step on the wafer 
surface would be too wide on one side of the step and too 
narrow on the other- To solve this problem, a two-laj^er 
resist process was developed. 

Fig, 2 shows the process steps. Two photoresist materials 
with very different properties are used. The bottom layer is 
PMMA (polymethyl methacrylate). This layer planarlzes 
the wafer surface topography so that the top layer is uniform 
in thickness. The top layer is a standard positive photoresist 
which is sensitive to 436-run-wavelength light (near- 
ultraviolet). The bottom layer is not sensitive to the imaging 
light and serves as a carrier for a dye which acts to absorb 
the imaging light. The dye was selected for its strong ab- 
sorption at 436 nm as illustrated in Fig. 3. This characteris- 
tic, along with the relatively long path for light to be re- 
flected by the wafer surface back to the top layer (two times 
the PMMA thickness), means that only a small amount 
(3 7(i-by- weight) of dye must be added to the PMMA to 
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decouple the imaging exposure from the wafer surface. 

After the top layer is exposed, it is developed. The pat- 
terned top layer then forms a mask for exposing the bottom 
layer. The wafer is blanket exposed with deep- ultraviolet 
light [wavelengths < 250 nm), which induces rupture of the 
molecular chains in the PMMA layer, rendering the ex- 
posed area soluble. The top resist layer is opaque to these 
short wavelengths and therefore serves as an effective mask. 
The dye in the PMMA layer bleaches and does not strongly 
absorb the deep-UV light. Therefore, the thick layers of 
PMMA can he completely exposed in depth. Fig. 3 also 
shows the absorption and sensitivity characteristics of the 
positive photoresist used, PMMA is sensitive only to radia- 
tion wavelengths shorter than 250 nm. It has extremely 
good contrast, but low sensitivity. This implies excellent 
linewidth contiol. but long exposure time, 

/Vfter the deep-UV exposure, the top layer of resist is 
removed. MX-931 developer is used to remove the layer of 
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Fig, 4. Microphotograph of exposed and developed PMf\AA 
pattern. 
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intermixed material that forms where the two resist mate- 
rials contact each other. The thdckness of this interlay er is 
minimized by the use of Kodak 809 photoresist rather than 
other alternative resists which crosslink more readily. After 
interlayer removal, the PMKIA layer is developed in MIBK 
(methyl isobuiyl ketone). 

Fig. 4 shows the typical vertical sidewalls of the PMMA 
lines produced by the process. Note the uniform Hnewidth 
independent of underlying topography. Linewidth unifor- 
mity and control are also evident in Fig. 5, which shows 
PET breakdown voltage [BVqs^] measurements before and 
after the tw^o-layer resist process w^as used to define the gate. 
The conventional process was characterized hy lack of 
line%vidth control, which resulted in extremely variable 
BVoss and other FET parameters, 

A major emphasis in the two-layer process development 
has been manufacture bility. The process involves no com- 
plex film deposition or etching. Application of both resist 
levels is accomplished cassette- lo-cassette in an in-line 
coat'bake system. No operator intervention is required. 
Deep-UV^ exposure is done with a source that uses an RF- 
excited electrodeless Hg bulb and quartz optics to achieve 
wafer exposure about twenty times faster than the deep-lA^ 
sources available dnrlng the early development phase of the 
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Fig. 5. Compart son of the variation in breakdown voltage 
S^Dss ^^^^Q 3 corfveniional restst process and the two-layer 
resist process, 

project. The deep-UV sources axe mounted on cassette-to- 
cassette v^rafer handling systems to minimize the need for 
operator intervention. 

Emphasis was also placed on development of etches 
compatible with PMMA. PMMA has poor plasma etch resis- 
tance compared v^ith many other resists and also has mar- 
ginal substrate adhesion, which is detrimental for wet etch- 
ing- By controlling plasma etching cooditionSp especially 
wafer temperature, successful etches have been developed 



Yield Improvement by Use of Pellicles 



Step-and-repeat photolithography has high yield potentlai 

because the reticle can be made perfect. However, the reticle 
must remain free of contamination in production or a senous 
defect may be replicated on every dre printed with that reticle. To 
mininnize or eliminate contamination, HP uses pelltctes on all 
reticles used in the NMOS-lll process. 

Fig. 1 shows a peElicle mounted on an NMOS-lll reticle. It 
consists of a thin nitrocelliilose membrane stretched on a metal 
frame. The frame is bonded to the reticle to encapsulate the 
chrome pattern. One pellicle is used on each side of the reticle. 
The volume between frame, membrane, and reticle is cleaned of 
particles during assembly. In use, particles thai fall on the 
membrane are out of focus and ttius are no! imaged untess they 
are large. Any large particles are easily detected and removed by 
the aligner's operator using a very simple in-situ detection sysiem. 
The aligner's exposure source is turned on and the operator looks 
for light that is scattered from any particle present on the pellicle. 
During this observation, an optFcal filter is used to eliminate 
confusion from particfes too small to matter. 

The peNicle works fike an optical coating tuned to high 
transmission at the operating wavelength (436 nm). To achteve 
transmission > 98%, which is required for exposure uniformity, 
the 865-nm membrane thickness must be uniform within ±10 nm 
over the entire field. 

Success in eliminating repeating defects depends upon 
contamination-free attachment of the pellicles to the reticfe. There 
can be no trapped particles larger than 3 jxm in diameter. TJiuSp 
great care is taken in the assembly operation to avoid particles. All 
assembly is done in a bath of ionized laminar air flow. Reticles are 
sthpped of any organic residue (resist, pellicle adhesive, etc.) in a 
mixture of sulphuric and chromic acids and ttien cleaned and 
dried automatically using a brush, detergent, and a high- pressure 
water jet. Pellicles are manually cleaned of particles by using a 



miniature air jet. Inspection of the completed assembly is done 
using a stereo microscope with illumination designed to highlight 
any particies and deemphasize the background. 

The step-and-repeat optical aligners have been modified to 
accommodate the use of pellicle-protected reticles, To minimize 
dust accumulation and contamination, the reticles are stored in 
special filtered laminar-flow cabinets. Low-pressure airgun 
bfowoff is the only method used to clean the assemblies in 
production. 

-Robert Stutz 




Fig, 1 . Photograph of peiiide used m the NMOS-ili photo- 
lithograph/ process. 
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for palterniiig nitride (masks 1 and 4 in the NMOS-IH pro- 
cess] . first metal (mask 6). and vias (mask 7), PXRLA also 
forms an effective ioo implajit stop for formiog depletion 
loads {mask 2). A wet etch process was developed to pattern 
the gate oxide (mask 3) and contacts (mask 5j. The etchant is 
NH4F and water, buffered with citric acid. Other more con- 
ventional buffers such as acetic acid were found to induce 
unacceptable resist lifting at pattern edges. 
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^^^^A ^^^^^B ^nd coninbyfed to the de- 
velopmena of ttie design and assembry process 
forthe finstraies used m HP s 32-bEt VLSI computer 
system An author oi a paper on computer-a^ded 
re4iabrlitv screening, he is married ^d the taiher 
ol two ctiildren, and lives m Loveland. Colorado 
He recently completed building his home. s<ngs 
bass in his church chojr, and enjoys running and 
playing volleyball, 

Guy R. Wagner 

^^^ Joining HP in 1981 with 

^^^^^^^ se verai years of expe rience 
^H^^^^fc in designing PBX tele- 

W ,^ ifc W phone systems. Guy 
'm -y* I Wag ner contributed lose V- 

^-^^ era I parts of the Memory/ 

,^^^- Processor Module design. 

^^ He currently is working on 

/ VLSI packagjng technol- 

ogy Born in Dubuque. Iowa, Guy attended Jowa 
Slate University, eaming a BSME degree in 1970 
and an MSME degree in 1 972 He «s a member of 
the IEEE and the International Electronic Packag- 
ing Society and his work has resulted m one pateni 
related to printed CfrcuiE mounting. He is marned, 
has a daughter, and lives rn Love I and. Colorado. 
His interests include photography, flying, camping, 
and radio-controlled model airplanes. 



Falls. Montana. t>ets mamed, hssa^^auglTtar. and 
lives in Lovel^id, Coloraiki Outside di work, he 
enjoys sailing, cross-country skiing, b^cycliing, 
wocxtwofkiog. and tqucing on hts moiofcycie- 
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Fung*Syn Fel 

Fung -Sun Fet recaived a 
PhD degree from the Uni- 
' versity of Virginia in 1976 
and ioined \-\P shortly after 
Setore assuming his cur- 
,^ y ^ rent responsibilities as an 
~^ ^9mb^^' ^^^ Pi'Ojact manager tor 
%^^r delect density reduction 

and reliablltty improve- 
ment, he managed the NMOS III metallization 
process project He is married, has a son, and 
lives in Fort Collins. Colorado. His interests include 
hiking, swimming, and bicycling. 

James M. M^kkelson 

Graduating from the Mas- 
sachusetts institute of 
Technology in 1972 with 
ihe BS, MS. and Electrical 
Engineer degrees in elec- 
trical engineering, Jim Mrk- 
kelson started at HP m early 
1973 and worked on the 
development of the ISJMOS- 
II process before becoming an NMOS-lll design 
and development project manager, He is coauthor 
of several p apers re lated to the NMOS- 1 M p rocess 
(one of which received the Outstanding Paper 
Award at the 1981 ISSC Conference) Jim is a 
mefnbar of the IEEE and Sigma Xi Bom in Great 
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Norman E. H«ndficlcson 

Nonti Handhckson 
^^^^^ft g fad uated from tfw Univer- 

m Bki ^'^^ ^' Minnesofa in 197B 

y_ ^ WP f**\h an MSEE degree He 
^9 ^' ^B 'henioinedHPandconthb- 
^^^^^ ^B uted to the development of 
^^Hlu^Y the NMOS-lll process at 
^^^^m HP's lactlity in Fort Collins, 

^^^ Colorado His outside in- 

terests include rock concerts, bfcycle racing, and 
skiing over bhg mogufs. He lives in Fort ColJtns, 

Donstd E. Novy, Jr. 

Since joining HP in 1960, 
Don Novy has charac- 

teri2ed and supported the 
production of several of the 
' materials used in ihe 
NMOS'll I metallization pro- 
'"ess. He was bom in 
Hoffman Estates, lltfnois 
I and attended Purdue Uni- 
versity where he earned a BSEE degree in 1979 
and an MSEE degree in 1980. He is married, lives 
in Fori Collins, Cotorado, and is interested in 
amateur radio and designing and building a home 
computer 

Daniel D. Kessler 

An IC process engineer at 
HP's facility in Foi Collins, 
Colorado, Dan Kessler 
worked on several tungsten 
rJepOSition processes for 
NMOS lis and yield jm- 
provemeni for the 32-bit 
VLSI CPU chtp, He joined 
HP in 1980 afier receiving 
an MSEE degree rrorn Ihe Massachusetts Insillute 
of Technology where he also earned a BSEE de- 
gree irt 1978. Born in Lansing. Michigan. Dan is 
the author ol a paper about a new photoconducttve 
device He has a daughter, lives in Fod Collins, 
and enjoys bicycling, camping, hiking, tennis, and 
racquet ball. 

David W. Quint 
' ^^..^^S^B Dave Quint has the BS and 

^AmJ^^PpI MS degrees in eiecirical 
^^^^^^^^r^ engineering awarded by 
^^ ^t_^^B iheUniversityofWisconsin 
\^ ^^ ^^H ^^^ ^ PhDEE degree 
I l^^hj'^ awarded by the Mas- 
^^^B^f sachusetts Institute of 

^P^r^ Technology He Started at 

^^ H P in I a te 1 9 79 and worked 

on NMOS 111 as a process engineer Dave is the 
author of one paper and a coinventor for two pa- 
tents, one related to CVD tungsten deposttion A 
native of Wisconsin, he served in the U S, Air Force 
be lore beginning his college education He rs 
married, has three chJIdren, and lives in Fort Col- 
lins, Ooiorado. 
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ilftmes P. Roland 

^■k Jim Roland received a 

BSEE degree from General 
•• Motors Institute in 1970 
and an MS degree from 
Pufdue Universily in 1971, 
k^ ^ ^K^TV/ ^^ worked for a research 

pf ^BBf laboratory for two years 

■i and I hen relumed to his 
■ studies 3l Purdue and 
earned a PhD degree in 1 977. Jim then joined y^P 
and worked on several aspects of the NMOS-III 
metallization process. He is the author of a paper 
about NMOS-HI metallization and comventor of a 
patent on a liquid-level sensor design. He was born 
in Detroit, Michigan, is married, and lives in Fort 
Collins, Colorado. He js vice-president of the local 
chapter of the Optimists and is interested in skiing, 
bicyclfng, racquelb.aJL and camping. 
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Gary L. Hillis 

A native of Kokomo, fn- 

d i ana , Ga ry H i n is attend ed 
Purdue University and re- 
ceived a BSEE degree in 
1 978 and an MSEE degree 
inl979 He then joined HP 
and was part of the initial 
Team establishing rhe 
photolithographic facility at 
HP's pl^nr in Foh CoHlns. Colorado. Gary is a 
coauthor of a paper on phoioljthography and co- 
Inventor of a patent related to the use of absordEng 
dyes in photoresist He is married and the father 
of a daughter, and lives m Fort Coihns His interests 
Include sailing^ skiing, and basketbaJi 
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Howard B. Abftfiaifi 

^^^^ With HP Since 1969, How^ 

^^B^^^ ard Abi'aham has worked 

#" ^^^ on a number of projecls, 

f B inol tiding microwave de- 

^ '^' - vice design, NMOS-II pro- 
I cess and product devebp- 
I ment, and most recently, 

NMOS-JIJ pholoEaho- 
I graphic technology. His 
work has resulted «n two patents related to 
semsconduclor device fabrication and a pa per on 
transistor device design Before joining HP, he was 
a flight control engineer lor the Gemini and Sur- 
veyor space missions. Howard has a BSEE degree 
(1 962) and an MSEE degree ( 1 964) from the Unn 
verssty of Wisconsin ai Madison, and has done 
three years ol graduaie work at the Ltnsversiiy of 
California at Berkeley He is a member of the IEEE 
and Sigma Xi. Married and the father of two chil- 
dren, he lives in Love land, Colorado Outside of 
worVc, he sings in a church choir, is interested in 
sofar energy applications and classic 
automobiles, and enjoys motorcycJing^ computer 
programming, and flying. 

Mark Stol£ 

Born in Presque Isle, 

JjOB^f^ Maine, Mark Stolz studied 
^tfl^^m eiectricaf engineering and 
^T^^^^^B computer science at the 
^P^^r ^^B Un Eve rsi ty of Co lorado an d 
*j^^ ^3 received a BS degree in 
^ 1976 He then worked on 

"^•Jjk high-speed data communi- 

^ cation systems for a major 

defense contractor before joining HP in f 979 He 
instalted the environmental control system for the 
NMOS-lll process area and currently is responsible 
tor the step-and- repeat aligners used For NMOS- 1 1 1 
pfoductiqn. Mark is a member of the National 



Society of Professional Enginaefs Living in Fort 
Collins, Colorado, he likes most outdoor activities, 
but particutarfy enpys skiing, backpacking, bicy- 
cling, tennis, scuba divir>g, softball, and flyir^. 

Keith G. Bartlett 

Graduating from Colorado 

State University with a PhO 
degree in physics in 1 977, 
Keith Bart let! vi^ofked on 
MOS memories for a ma}or 
semiconductor manufac- 
_ turer before joining HP in 
1 979. He worked on photo- 
lilhography for the NMOS- 
HI process and now is a prodoction engineering 
project manager His contribuiJons have resulted 
in four patents and several papers fetated to optics 
and IC processirvg. He is married, has a daughter, 
and lives in Fort CoiJins, Colorado. When noi train- 
ing bird dogs, he enjoys fly fishing. 

Martin S. Wilson 
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Before coming to HP in 

1973, Many Wilson worked 
on rocket and ramjet en- 
gine deveiopmeni At HP, 
he con in bu led to the de- 
. ^^m^ ^^9^ ^^ ^^^ thermal printer 

^J^w^^^ for the 9845 Computer, 

fflflS^HKii ■■. and currently is an NMOS- 

wM^BKnilL III photolithography project 
man ager H e rece i ved t he 6 S a nd MS degrees i n 
aerospace engineering at the University Of Col- 
orado in 1967 and an MBA degree ai California 
State University in 1972 Bom in Denver, Colorado, 
he now lives m Lovetand, Colorado, is married, 
and has two daughters. His interests include build- 
ing projects, backpacking, hunting, fishing, and 
collecting Greek and Rofiian pottery. 
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